0 оценок0% нашли этот документ полезным (0 голосов)
24 просмотров14 страниц
The appearance of this ISSN code at the bottom of this page indicates SAE's consent that copies of the paper may be made for personal or internal use of specific clients. This consent does not extend to other kinds of copying such as copying for general distribution, for advertising or promotional purposes. To request permission to reprint a technical paper or permission to use copyrighted SAE publications in other works, contact the SAE publications Group.
The appearance of this ISSN code at the bottom of this page indicates SAE's consent that copies of the paper may be made for personal or internal use of specific clients. This consent does not extend to other kinds of copying such as copying for general distribution, for advertising or promotional purposes. To request permission to reprint a technical paper or permission to use copyrighted SAE publications in other works, contact the SAE publications Group.
The appearance of this ISSN code at the bottom of this page indicates SAE's consent that copies of the paper may be made for personal or internal use of specific clients. This consent does not extend to other kinds of copying such as copying for general distribution, for advertising or promotional purposes. To request permission to reprint a technical paper or permission to use copyrighted SAE publications in other works, contact the SAE publications Group.
400 Commonwealth Drive, Warrendale, PA 15096-0001 U.S.A.
Tel: (724) 776-4841 Fax: (724) 776-5760
SAE TECHNICAL PAPER SERIES 2000-01-1052 Delphi Secured Microcontroller Architecture Terry L. Fruehling Delphi Delco Electronics Systems Reprinted From: Design and Technologies for Automotive Safety-Critical Systems (SP1507) SAE 2000 World Congress Detroit, Michigan March 6-9, 2000 The appearance of this ISSN code at the bottom of this page indicates SAEs consent that copies of the paper may be made for personal or internal use of specific clients. This consent is given on the condition, however, that the copier pay a $7.00 per article copy fee through the Copyright Clearance Center, Inc. Operations Center, 222 Rosewood Drive, Danvers, MA 01923 for copying beyond that permitted by Sec- tions 107 or 108 of the U.S. Copyright Law. This consent does not extend to other kinds of copying such as copying for general distribution, for advertising or promotional purposes, for creating new collective works, or for resale. SAE routinely stocks printed papers for a period of three years following date of publication. Direct your orders to SAE Customer Sales and Satisfaction Department. Quantity reprint rates can be obtained from the Customer Sales and Satisfaction Department. To request permission to reprint a technical paper or permission to use copyrighted SAE publications in other works, contact the SAE Publications Group. No part of this publication may be reproduced in any form, in an electronic retrieval system or otherwise, without the prior written permission of the publisher. ISSN 0148-7191 Copyright 2000 Society of Automotive Engineers, Inc. Positions and opinions advanced in this paper are those of the author(s) and not necessarily those of SAE. The author is solely responsible for the content of the paper. A process is available by which discussions will be printed with the paper if it is published in SAE Transactions. For permission to publish this paper in full or in part, contact the SAE Publications Group. Persons wishing to submit papers to be considered for presentation or publication through SAE should send the manuscript or a 300 word abstract of a proposed manuscript to: Secretary, Engineering Meetings Board, SAE. Printed in USA All SAE papers, standards, and selected books are abstracted and indexed in the Global Mobility Database 1 2000-01-1052 Delphi Secured Microcontroller Architecture Terry L. Fruehling Delphi Delco Electronics Systems Copyright 2000 Society of Automotive Engineers, Inc. ABSTRACT As electronics take on ever-increasing roles in automotive systems, greater scrutiny will be placed on those electronics that are employed in control systems. X-By-Wire systems, that is, steer- and/or brake-by-wire systems will control chassis functions without the need for mechanical backup. These systems will have distributed fault-tolerant and fail-safe architectures and may require new standards in communication protocols between nodes (nodes can be considered as communication relay points). At the nodes, the "host" application Electronic Controller Unit (ECU) will play a pivotal role in assessing its own viability. The microcontroller architecture proposed in this paper focuses on ensuring thorough detection of hardware faults in the Central Processing Unit (CPU) and related circuits, thus providing a generic fail-silent building block for embedded systems. Embedded controllers that implement the Delphi Secured Microcontroller Architecture will provide high deterministic fault coverage with relatively low complexity. INTRODUCTION Many techniques to validate the node or host ECU are presently employed, ranging from software intensive schemes such as diverse (redundant) algorithms with time redundancy, to hardware intensive schemes such as Asymmetrical Microcontrollers, or completely redundant multiple microcontroller architectures. An alternative to these schemes is the Dual Central Processing Unit (DCPU). This strategy has proven successful for Delphi and has reduced hardware and software design complexity. Additional payoffs include, reduced software validation requirements, increased system reliability and decreased EMI/EMC at the ECU. In the future the Delphi Dual CPU architecture will include a new data surveillance module called the Data Stream Monitor (DSM). The function of this module will be to cover faults in the data streams processed by the Dual CPU, keep fault detection latency to a minimum and maintain a high level of independence between the application control algorithm and the fail-silent implementation. Delphi has been a leader in implementing secured data processing techniques in automotive embedded microcontrollers. The dual CPU made its debut in production ABS programs in 1996. As this paper is written, the system has a proven track record with over 10 million units fielded. BACKGROUND - THE MCU/CPU SYSTEM The problem of fault detection in an embedded microcontroller will be discussed in three broad categories: CPU, memory and MCU peripherals. To facilitate discussion and gain insight into how the Delphi Secured Microcontroller Architecture evolved, the microcontroller will be defined as follows: A microcontroller Unit (MCU) is comprised of a central processing unit (CPU) and associated peripheral devices. The peripheral devices may be general or customized to the controller application. These can include communication devices such as serial peripheral interfaces, as well as timers, auxiliary power supplies, A/ D converters and other devices, built on the same integrated circuit. The core of the MCU (CPU core) is the CPU together with memory it immediately acts on, such as RAM, ROM/FLASH, EEPROM, and the communication bus that links these elements. An MCU dedicated to the control of one vehicle subsystem, such as anti-lock brakes (ABS), is considered to be embedded in that subsystem. Further, when the MCU is part of an application Electronic Control Unit (such as an ABS ECU) which contains interface circuits supporting specialized I/O requirements, the combination may be referred to as an embedded controller. These distinctions are made to help organize the way the ECU is partitioned and helps clarify the proper fault detection methods for components in the subsystem, including external sensors and electromechanical drive elements or internal ECU integrated circuits. It is also done to develop a layered fault detection model for the vehicle system. The layered fault detection organization facilitates a bootstrap sequencing that helps to solve the problem of who checks the checker. 2 Overall, the method minimizes redundancy between vehicle system software diagnostics and ECU system built-in self-test. The layered fault detection model is similar in spirit to layered models found in network communications systems. The bootstrap technique discussed in this paper is continued throughout the ECU interface subsystem circuits. The correct operation of the CPU, memory and MCU's peripherals (such as timer modules, A/D converters, communication, and output driver modules, etc.) must be established not only during the initialization phase following power on, but also during repetitive execution of the control program. Normally bootstrap test schemes are only run at power-up. Delphis unique architecture facilitates continuous fault detection using this method without choking the throughput capacity of the CPU. The generic modules that make up the Delphi Secured Microcontroller are considered fail-silent because the outputs are disabled in the event of a fault. However, backup communication output signals will be produced to flag the vehicle system when a fault has occurred. Figure 1 shows a typical layered model used for fault detection analysis in a simple ABS system. Figure 1. Supplemental software techniques are also used to process critical redundant inputs/outputs (I/O) and the specialized BIST circuits of the ICs. In order to limit the scope of this paper, these topics will not be discussed in detail. Further, the fault detection techniques of the peripheral modules built onto the microcontroller, although briefly summarized, cannot be adequately covered in this paper. The focus will be limited to the task of continuously validating the state of health of the CPU Core system. Systems that employ embedded MCUs typically include self-tests to verify the proper operation of the CPU and associated peripheral devices. Typically these tests include illegal memory access decoding, illegal opcode execution or a simple Watchdog/Computer Operating Properly (COP) test. These strategies alone do not provide complete fault coverage of key MCU components such as the CPU. While other, redundant fault-detection methods frequently employed in automotive systems can detect the presence of hazards caused by faults in these components; it is much easier to guarantee safety when coverage of the CPU is provably complete. It is very difficult to achieve provably complete fault detection in a device as complex as a CPU without duplication and comparison. Test methods that are implemented so that execution occurs as the algorithm is running will be referred to as "on-line" or "concurrent" testing. Further, "off-line" testing will reference the condition when the device is placed in a special mode in which the execution of the application algorithm is inhibited. Off-line testing is generally reserved for manufacturing test or for special purpose, diagnostic test tools used by the field technician. Tens of thousands of test vectors are generated for manufacturing tests to establish a 99% fault detection level for complex microcontrollers. Designing routines to test the ability of a CPU to execute various instructions by using sample data and the instruction under-test that will be used in the application is impractical. Even If a separate "test ROM" [2] was included in the system to either: 1. Generate a special set of inputs and monitor the capability of the CPU and application algorithm or a test algorithm to properly respond: or 2. Generate and inject test vectors derived from manufacturing fault detection testing and then evaluate the capability of the CPU to properly process, and produce the correct resultant data at circuit specific observation points. Figure 2 illustrates the concept of a "test" or "stimulus" ROM and it's integration into a fail-silent system. Figure 2. 3 Implementing the first technique will encounter the following technical hurtles. In a complex system a test ROM will become inordinately large in order to adequately guide the CPU through a limited number of paths or "threads" of the application algorithm. The test vectors used must be carefully selected and requires intimate and detailed knowledge of the control algorithm software. Even if the "application systems" fail-silent test designer could manage the task of ensuring that every module was effectively tested, the end result would be of limited utility when considering the range of parameters that can be involved for any given software module. Thus the first test ROM method would be contrived and limited in its ability to simulate an actual operating environment. If the second technique were employed, unless all of the manufacturing test vectors were used or exhaustive testing using pseudorandom patterns is performed, the resulting coverage would be partial and the tests lengthy. Any attempt made to identify only the used portion of the MCU in order to target the subset with the proper vectors (to reduce the overall vector quantity) would require detailed scrutiny and modification every time the algorithm changed so that the appropriate changes are made to the test vector set. This approach requires detailed knowledge of the MCU and can only be accomplished with the active participation of the MCU manufacturer. The technique, although useful for an initial start-up verification would have implementation difficulties for continuous validation of the system in a dynamic run mode of operation. Neither of the above techniques consider the concept of monitoring a system based on execution "dwell time" in any particular software module or application "run time mode" condition. Modifying a CPU to include built-in self-test, such as parity to cover the instruction set look up table, duplication or Total Self Check (TSC) circuit designs, current drain
testing, level and threshold testing, etc., of subcomponents of the CPU, may result in a significant modification to a basic cell design [7]. It was Delphi's goal to develop a fail-silent method that could be implemented on a variety of architectures from different MCU silicon suppliers with a minimum of interaction. Although Delphi uses some of the above techniques, they are employed in "stand-alone" modules, such as wheel speed interfaces, steering drive controls or the functional compare modules. These modules are Delphi commissioned, and vehicle system specific. They are connected to the MCU bus but do not modify the silicon suppliers core modules. For good reason, CPU designers are reluctant to modify or make even minor changes to proven designs, since experience is the key to confidence in a CPU implementation. Duplication of the CPU in the Delphi Secured Architecture preserves the knowledge and reliable performance of the basic CPU core. This approach was considered less invasive and would require less testing and validation, than designs that attempted to modify the CPU core with an assortment of BIST techniques [7]. Further, this technique minimized sharing with the CPU manufacturer the Delphi responsibility for overall system safety. The Dual CPU concept has facilitated implementations on multiple manufacturers CPU cores and architectures. At this printing Delphi presently has Dual CPU implementations on nine different microcontrollers in production. Software techniques that involve time redundancy, such as calculating the same parameter twice via different (diverse) algorithms, also require that multiple variables be used and assigned to different RAM variables and internal CPU special function registers, i.e., time redundancy requires hardware resource redundancy to be effective. Because of the substantial amount of CPU execution time needed for redundancy the CPU processing capacity must be doubled to accomplish the redundant calculations in a real time control application. Because of the added complexity necessary for this implementation of redundancy, the verification process is commonly long, lengthy, costly and prone to human errors and omissions. Software diagnostics should be devoted to identifying improper behavior in the overall system, not to testing microcontroller hardware. SELF-TESTING APPROACHES The following sections examine four broad categories of self-testing concepts: Dual Microcontroller, Asymmetric Microcontroller, Memory verification and the Dual CPU implementation. DUAL MICROCONTROLLER CONCEPTS Having a Logic function, or any device, test itself is a questionable practice. In the Delphi system the CPUs are duplicated and hence test each other. As noted, the process of requiring the CPU to perform its own self-testing on all MCU supporting peripherals is inefficient. This is especially true in applications having a relatively large memory and with many complex peripheral devices. To date, the most direct way to solve this problem has been to simply place two microcontrollers into the ECU system. In such systems, each microcontroller is the compliment of the other and each memory and peripheral module is duplicated. Both devices execute the same code in near lock step. Figure 3 shows a mechanization of a dual microcontroller. The illustration is provided to show the quantity of parallel input signal/output signals required for its implementation. 4 Figure 3. Dual microcontrollers are effective because they check the operation of independent microcontrollers against each other. Although the system tests are performed with varied threads through the algorithm, and the technique accounts for variable dwell in any portion of the application with the random-like data that occurs in the actual application environment, the following must be considered: 1. Data faults or hardware faults that may occur are used to calculate system parameters. In a dual microcontroller system these parameters will be processed and may be filtered before they are compared by the second microcontroller. Since the direct data is not compared, as in the Dual CPU system, the comparisons are considered second order and the source and nature of the fault could be masked, prone to communication delay or missed altogether. 2. Many parameters will have to be checked at different rates. Also, tolerance ranges that are used to check parameters between the two microcontrollers will be looser than in a direct data comparison, first time fail", Delphi architecture type system. 3. The number of times that mis-compare between the two MCUs actually occurs, before a fault is actually logged and responded to must be established. Further, conditions will need to be established on what conditions the fault counter will be restarted. 4. The fail-safe software is not independent from the application algorithm. As adding parameters modifies the application algorithm, fail-safe software alteration must also be evaluated. 5. Some parameters are used to calculate other dependent parameters. This could lead the system architect to make value judgements on which parameters are determined critical or not. This increases the subjectiveness of the system and the true error detection capability and latency. 6. This technique is not an efficient form of resource allocation [4,5,8]. Two identical, fully equipped, microcontrollers doing the same task is costly. Further, extensive communication software is also used to synchronize the data exchange between the two microcontrollers. ASYMMETRIC MICROCONTROLLER ARCHITECTURES There are many hybrid schemes in this category. The secondary processor size and speed requirements can vary dramatically depending on the extensive nature and the variety of the validation tasks it is assigned to perform on the main processor. Much of the appeal to this implementation lies in the ability to use standard "off the shelf" components, and if optimized, can gain a cost advantage over dual microcontroller architectures. The secondary processor can be used to do an intensive check of a few portions of the algorithm, or to employ check-point and audit schemes of many modules within the control algorithm (or some of both). Control flow analysis of the main controller is also a popular use of the secondary processor. Control flow designs can also have a great diversity in the complexity of the final implementation. The basic concept is to validate that the main processor executes code from one module to the next in a logical manner. By sending Software Module Identifiers (SMIDs) to the secondary processor the overall program flow of the main processor can be monitored. The module SMIDs give the secondary processor the capability to determine variations in loop or software module execution time. The ability of the main processor to transfer to, and return from subroutine properly can also be ascertained. These schemes can give an indication of the state of health of the main processor, but the actual deterministic fault coverage is deceptive and difficult to test and measure (the author has personal experience with placing these schemes in production). The asymmetrical approach also finds use in conjunction with the software technique that employs diverse programming with time redundancy. In this implementation the main processor is large and fast enough to accommodate two algorithms which process critical input and output variables in two separate ways. An attempt is made to create as much of a dual channel on the same main processor as possible. This is accomplished by using as many different resources of the microprocessor hardware as possible. Data to be processed will be held in different RAM locations. If possible, the processor will use different internal registers for data manipulation. The system may also use complimentary RAM variables to perform and check RAM parameter calculations. 5 A mechanization showing the implementation of two redundant (diverse) programs on the Main microcontroller is presented in Figure 4. The illustration also depicts the secondary processor that tests the capability Main CPU to execute the control program with periodic contrived input data. Depending on the system, the data acted on by the two algorithms may be exactly the same, in which case the results should closely match. Some schemes may allow the data to be slightly different, in which case the compared results would have to be bounded to create reasonable limits on this data differential. These requirements of software and hardware will increase the complexities of the final design. Finally, both algorithms have to eventually be processed by the same logic unit of the microcontroller. If the logic unit is corrupt then both diverse algorithms could calculate the same corrupt result. To circumvent this, a second smaller processor is used to send data to be processed by the main controller. Figure 4. The main microcontroller processes this data and sends the result back to the second processor for comparison. This is an attempt to test a part of the main processor that cannot be duplicated. In a runtime application, the control algorithm can run many times without executing certain software module functions. The special test injected by the secondary processor also serves to ensure that all modules can be executed and tested on a scheduled basis. A similarity can be drawn between the second processor in this implementation and the test ROM technique mentioned earlier. Hence this process suffers from the same flaws when contrived and limited data is used to test a microcontroller. MEMORY VERIFICATION Single Bit Parity A common technique for verifying the operation of a MCU memory peripheral is to use a check sum, where a process arithmetically sums the bits of a block of memory. The check sum is then compared to a reference value. A miscompare represents a memory failure. One disadvantage of check sums is that if two opposing bits of the memory are flipped to the opposite state then the checksum will continue to be valid. The failure to correctly detect such data faults is known as aliasing. This fault type is rare and requires that two faults occur exactly or almost simultaneously. Sum checks are slow, since it is usually performed by the CPU during time slices allocated for the purpose. Due to increasingly large memory arrays and heavy demands on CPU resources, the validation may not occur within the time responses of the system. Another technique for verifying the operation of MCU memory peripherals is to use parity. Single bit parity is faster than the checksum method described above, and synchronizes the memory validation with its use in the execution of the application algorithm. It will also however require the memory array design to be modified and it will require decoding by special hardware. This modification although not difficult, must be agreed upon with the silicon supplier. The Delphi system takes advantage of parity circuits on small memory arrays such as RAM. At the present time this drives the development of a custom module. With large memory arrays parity can become a significant cost burden. The CPU and specific software must process the consequences of a parity fault. Since single bit parity is insensitive to double bit flips, multiple bit schemes and implementations have been developed to avoid aliasing. Delphi Memory Fail-Silent implementation The Delphi System exploits a concept of minimal redesign of the silicon provider's core element cells such as the CPU and memory. The surveillance module described in this text attaches to any CPU system bus (Harvard or VonNeuman with cache or superscaler implementations) and performs the same validation functions on any architecture. This gives Delphi the advantage of development of equivalent fail-silent systems with a variety of manufacturers. Another advantage of the Delphi System described in this text is the capability to automatically capture and store the location of the fault and what the CPU was executing at the time of the fault. Error Detect and Correct Modules To circumvent the requirement of adding special hardware to the CPU or software to the application, multiple bit party schemes and standalone Error Detect and Correct (EDC) processor modules have been developed. The problem of modifying the memory array or adding another memory array to include the extra parity bits still exists. In a typical application, six bits are added to a 16-bit word or four bits on eight [10]. Consequently and depending 6 on the implementation, up to 50% of the memory along with the extra internal module interconnections may be devoted to the problem of capturing flipped bits. Silicon providers have typically incorporated these circuits because of the problems existing with the reliability of FLASH. Using syndrome testing and Hamming Codes [9,10], EDC can detect and correct single bit errors, detect all two-bit errors, and detect some triple bit errors. Although EDC is adequate to detect flip bit errors, the module is intrusive and must be placed in series between the memory module and the CPU. All data is channeled through this device for processing before it is sent to the CPU, potentially adding an execution delay to the system on every memory read. While this may be acceptable to some silicon suppliers that need this module to cover for inherent processing problems, EDC approaches must be modified to meet Delphi's future fail-silent goals. Extended Delphi Memory Fail-Silent Goals The Delphi System goals are to recapture these memory resources for the application program and to provide the system with more automatic fault response independent of the state of the CPU. The Delphi System also includes special registers to automatically capture diagnostic information, and continuous safeguards ensuring the correct operation of the surveillance module itself. Although the differences may appear subtle, EDC modules require configuration and driver software. When a fault occurs, a flag or interrupt is generated and the CPU must respond. A fault is detected in the Delphi System and responded to by the surveillance module itself, requiring no input from the CPU. If the CPU is healthy it can then read the flags and process the interrupts. In certain conditions the CPU will be prohibited from clearing the fault condition until a reset occurs and a complete diagnostic routine is run. DEVELOPMENT OF THE DUAL CPU Providing a second Microcontroller operating in parallel with the first is not software and hardware resource efficient [6,8]. The human error involved in software verification or ensuring all critical parameters are included and checked was deemed unacceptable. This led Delphi to develop a dual CPU system incorporated into a single microcontroller unit (MCU). In such a system each CPU receives the same data stream from a common memory. The purpose of the secondary CPU is to provide a clock cycle by clock cycle check of the primary CPU in a functional comparison module. If the data from the memory is corrupt, it will be discovered at a later step in the validation process. Figure 5 is included to show the simplification in the system mechanization for an ABS controller when a Dual CPU is employed. Figure 5. To ensure that the CPUs are healthy, both CPUs must respond to the same data in the same way. The Dual CPU system employs continuous cross-functional testing of the two CPUs as multiple paths are taken through the application algorithm. It should be noted, that if the system dwells in one software module or mode disproportionately to others, the testing by the Dual CPU is similarly proportionate. Further, the random-like parameter data inherent in real world applications is operated on by the algorithm and any inappropriate interaction with the current instruction data stream is detected. This technique has proved effective for all environmental conditions such as temperature, voltage or electromagnetic interference (EMI). In essence the actual algorithm and data execution become the test vectors used to ensure critical functionality of the system. This is a corollary to common test methods that are designed to detect critical faults. The system tests only those hardware resources the software application algorithm utilizes, and does not spend any time testing unused portions of the CPU system. Figure 6 illustrates an expanded view of the Dual CPU. Although both CPUs receive the same inputs from the MCU peripherals, the second CPU's only output is to the Functional Compare Module. If the algorithm is modified to include a previously unused set of available instructions (such as a possible fuzzy logic instruction set), or new operational modes are added (such as Adaptive Braking or Vehicle Yaw Control), modification of the self-check system is not required. 7 Figure 6. The Dual CPU fail-silent system architecture is inherently independent of the application algorithm. Also, the primary design intent of a Dual CPU system is to respond to a fault on its first occurrence. The Delphi Secured Architecture - Dual CPU Data Comparisons versus Dual Micro Controller Both the Dual CPU and the Dual Microcontroller architectures compare similar data values to determine the existence of a fault. There is however, a significant difference in the quality of the values compared. For example, in the Dual CPU the data compared will be the actual input that is collected from a pair of similar or redundant set of peripheral interface modules. Consider the present application where two similar or redundant peripheral modules, such as two different input capture timer modules (i.e. - the MCU Core Timer Vs. the timers in the Delphi application specific Wheel Speed Input Module) with different time bases or two different A/ D converters capture the same event. The data will only deviate by the quantization, linearity, or offset error differences between the two inputs. Once these limits are set the Dual CPU can detect and respond to a first time fail. The quality of the data is improved because it is local and directly processed by the same CPU. This enables confidence in the first fault error detection. Conversely, a Dual Microcontroller algorithmically processes or filters the data by software, and also has to account for communication delay before the data is compared. This means that the Dual CPU provides immediate first order detection of data discrepancies whereas the dual MCU suffers from second order less accurate error detection. The Dual CPU Summary of Key Points Thus the architectural design simplifications achieved by the Dual CPU and the support modules that are either locally duplicated or have their own uniquely design BIST circuits, gain several advantages over any of the competing alternatives. Increasing hardware reliability due to the reduced component and interconnect count. Decreasing that part of EMI susceptibility and radiated emissions that is related to the smaller and reduced complexity of the board layout. Improved diagnostics because: The fault is detected at its source, without processing The system is a first time fail Increasing the software reliability by: Elimination of communications and data synchronization software Significant reduction in parametric comparisons and inherent software decision logic. Reduced complexity of software validation DELPHI SECURED ARCHITECTURE DESIGN PHILOSOPHY The development of the Delphi Secured Microcontroller Architecture is based on the following axioms: The Microcontrollers single CPU is insufficient to adequately determine its own functional integrity. Once the CPUs function integrity is verified, the CPU is then sufficient to verify the functionality of the microcontrollers peripherals, provided: Appropriate diagnostics are incorporated in the peripheral modules. Redundant input and output signals exist for plausibility checks of critical signals. Appropriate feedback signals exist for plausibility checks of critical signals. All peripheral modules will be verified at startup and as the control algorithm executes. Hardware implementations are preferred for fail-silent architectures for the following reasons: Promotes fail-silent independence from the application software. Hardware implementations put the diagnostic at a point where the fault can be detected the earliest. Hardware Fail-silent schemes can be implemented to minimize or eliminate competition for limited CPU and memory resources. More complete testing and validation can be achieved using manufacturing processes. The Delphi system is a bootstrap process that is dependent on verifying the CPU first and then the MCU peripheral modules. The process is run during the initialization phase and during repetitive execution of the control program. It is therefore advantageous to the execution speed of this method to incorporate peripheral BIST circuits that are independent of, and require minimal interaction with the CPU. 8 The Delphi architecture accomplishes this in the following manner: The secondary processor / functional compare module runs concurrent with the main control processor, and consumes no system resources until a fault is detected. There is a software module that will perform initial CPU configuration (such as clearing internal registers and testing the functional compare module) and handle faults diagnostics, however this code does not execute concurrently with the application software. The DSM runs concurrently and autonomously in background mode and in the start up initialization mode. There is configuration and test software that checks the DSM and handles faults but this code does not execute concurrently with the application software. When the foreground DSM operates for dynamic verification, there is a slight impact on CPU resources. The Delphi system takes advantage of continuously varying execution threads through the application code and the random-like data that occurs in actual use, to detect faults. A benefit of the Delphi Architecture is that the real time CPU and software execution testing is automatically proportionate to the time the system dwells in any mode. In actual use, the control program can run many times without going through every possible code path. When a particular thread through the algorithm inevitably does execute, the Delphi Architecture provides the following safeguards: The Dual CPU serves as a runtime functional check on the processing of code, data and output controls as it executes. The Data Stream Monitor (DSM) ensures that the code and data signatures, presented to the Dual CPU at runtime, match the code and data signatures that were generated when the code was compiled. The objective of the Delphi architecture regarding MCU peripherals is to: Ensure correct initial and continuous MCU system peripheral configuration. Incorporate sufficient diagnostics (HW or SW) to adequately cover both the CPU and MCU peripherals to ensure proper critical functionality of the system. Adequate coverage means the CPU and support peripherals will be diagnosed for failure, within the time response of the vehicle to ensure occupant safety. Critical functionality means that the Delphi architecture focuses on the MCU resources that are used (at runtime and when the system was developed). Those resources that are not used, or were never used when the system was developed, are not tested. DETAILED COMPONENTS OF THE DELPHI SECURED MICROCONTROLLER The following is a summary of the system that supports the mission of the Dual CPU in a mission critical embedded controller. One main control CPU controls MCU system peripherals. One secondary CPU lockstep operation with Main CPU, receives all the same inputs as the main CPU but its only output is to the Functional Compare Module. One Functional Compare Module Compare the Address, Data, and Control outputs of the Main and Secondary CPU. If a fault occurs the ECU system outputs are disabled. In the Delphi system the CPU stays active to aid in diagnostics. One Data Stream Monitor (DSM). A memory mapped module designed for autonomous and concurrent background testing of memory. In addition this module is capable of signaturing data streams while the CPU is on the bus. Parity on RAM Duplication of selected peripheral modules Secondary Clock oscillator and error detection circuits. Note: The mechanization that follows assumes the usual complement of self-test functions included with modern microcontrollers. Typically these functions include illegal memory access decoding, illegal opcode execution and simple Watchdog/Computer Operating Properly (COP) circuits. Figure 7 was included to show the enhanced diagnostic capabilities of the Dual CPU, that can detect faults and latch the status of the MCU at the time of the fault event. Figure 7. 9 General System Objectives To ensure that the microcontroller is operating as intended. "Operating as intended" is defined as the ability of the CPU and associated support peripherals of the MCU, to correctly process input data and output controls as required by the application algorithm. To ensure data execution coherency. In this context, coherency is defined as stable data, or the absence of flipped bits, stuck bits, transient and noise, or any intermittent inconsistencies in the data stream. To increase the deterministic fault coverage of the Delphi fail-silent system architecture. [1,4,5,7] To detect and respond to faults within the time response of the system. To minimize the fail-silent implementation dependency on the application algorithm. The fail- silent system is intended to be independent of the application control algorithm. The health of the MCU system is verified before the application algorithm is started. The MCU system is then verified concurrently as the algorithm executes. To reduce sensitivity and ensure integrity of the complete MCU system during all forms of environmental stress (Electromagnetic fields, RFI transient noise, thermal cycling/shock, etc.). Increase system reliability by decreasing component count, interconnections and simplifying fail-silent software. Development of the Data stream Monitor In the Dual CPU concept, successful testing of peripheral modules by the main CPU is predicated on its own correct state of health (i.e. the ability of the CPU to execute the algorithm as intended), and the Built in Self Test (BIST) circuits incorporated into the MCU peripheral modules. The job of the secondary CPU/Functional Compare Module is to guarantee the correct state of health of the main CPU. Then, as a secondary step, the Main CPU methodically tests as subordinate peripherals by exercising or polling their unique BIST circuits or by comparing data from redundant modules. The Delphi Layered Model for the Bootstrap Fail-silent Method This sequential scheme which first validates the CPU and then validates the MCU peripheral modules in a prescribed order can be considered as a bootstrap validation system. Figure 8 shows a layered model in a simplified steering assist system. It is the same technique as used for ABS systems (Figure 1). Figure 8. The order and priority that MCU peripheral modules or ECU subsystem circuits / ICs are validated is dependent on its hierarchical location within the layered model for the system. Because of the sequential nature of the bootstrap method and since this scheme is run at the initialization phase and during repetitive execution of the control program, the speed at which the CPU can detect faults in the MCU support peripherals is essential. It is advantageous to the execution speed of this method to incorporate peripheral BIST circuits that are independent of and require minimal interaction with the CPU. The Data Stream Monitor Mission The Data Stream Monitor (DSM) was devised to comply with the above goals. It is a stand-alone memory mapped module and requires no redesign of the CPU or memory peripheral modules to incorporate it onto the MCU system bus. Figure 9 shows the subsystem of the DSM that is responsible to validate the memory concurrently as the application executes by using CPU idle-bus-cycles or by stealing a cycle if needed. All memory blocks are automatically clocked into the system. The system is independent of the state of the health of the Dual CPU system. If a fault occurs all internal registers are latched for enhanced diagnostics. 10 Figure 9. The module is an adaptation a Linear Feedback Shift Register (LFSR) designed to accept and accumulate parallel inputs from the data bus. This implementation is commonly referred to as a Parallel Signature Analyzer (PSA). When implemented properly [1,3,8] for the application, the PSA is capable of accomplishing a form of data compression on extremely long data streams. The result of the data compression, referred to as the signature, is held in a register where comparison to a reference value can be made for fault determination. In one mode the DSM can take advantage of, or steal, idle bus cycles from the CPU. This is referred to as the background mode because the CPU is not driving the bus. During these cycles the DSM has the capability of autonomously downloading the contents of memory onto the system data bus. Each word of memory can be accumulated in the PSA in one clock cycle enabling high- speed signaturing of memory. Unlike EDC processors that are inserted between the memory module and the CPU, the DSM is a bus listening device and is therefore non-intrusive and easier to implement. As a result of the polynomial divisions that generate the final signature, the probability of aliasing is virtually eliminated [1,3,8]. In the Autonomous mode the DSM can verify memory at startup and concurrently as the algorithm executes, independent of the CPU or the CPUs state of health. The DSM in this mode represents a complete hardware implementation, and software support is not required. When a fault does occur, ECU output drivers (relays, solenoids etc.) are automatically disabled via a fault pin connected to the DSM fault logic (Figure 10). The Data Stream Monitor Foreground and Background Mode Applications The DSM device will process and detect faults in any data stream as long as it is deterministic. There are examples of code and data streams, which presently meet this criterion (this is in addition to the data stream derived from memory). One example is the hardware configuration software routines that are executed on each ignition cycle of the ECU/MCU system. To accomplish this task, the DSM is composed of two PSA circuit subsystems. One is dedicated to the background mode as described and the other is dedicated to performing that signature operation when the CPU is driving the bus (foreground mode). The two circuit systems are joined by a common Mode Control Module (Figure 10). The device as described, can ensure that the hardware configuration modules are processed by the CPU the same way each time they are executed. This is equivalent to a Dual Microcontroller system that verifies that both micros have initialized the same way in each and every ignition. Common Mode Bus Errors Since a Dual CPU system operates from a common memory, it is susceptible to common mode bus errors (from either bus or data transients). Further, the Dual CPU system depends on the fact that the two CPUs are in lock step. Determining the health of the Main CPU is predicated on the condition that both the main CPU and the secondary CPU react identically to the same information or data (independent of the quality). Even though the DSM monitors only the data bus it does offer protection against this condition. If the address or control bus is corrupted the result will manifest itself on the data bus with a corrupted signature. SUMMARY This article describes the evolution of a self testing architecture. It examined the need for self test of the CPU functions as well as the memory and the various current implementations. It has also examined the ability of the Dual CPU / DSM to achieve these goals. 11 Figure 10. CONCLUSION The Delphi Secured Microcontroller Architecture as presented is intended for use in stand-alone fail-silent systems. In the presence of a detected fault, the system output drivers will be disabled automatically; fault information will be latched and stored; however the CPU will stay active to enable online diagnostics. Current production applications default to a baseline mechanical backup system. The Delphi architecture is based on the premise that the closer a fault can be detected to its source the faster and more accurate the ultimate response will be. The concurrent coverage offered by the Dual CPU and DSM along with integrated redundant interface modules will provide the capability of deterministic fault coverage of the MCU / ECU system. At present, X-By-Wire systems are being proposed with multiple host (any application ECU is a host) redundancy at a communication node, or complex distributed redundancy, integrated with a master controller to manage the system. The ability of the ECU to accurately determine its own state of health utilizing a Delphi Architecture, will simplify system implementation, reduce communication complexity, improve fault response and diagnostic latency in present Fault Tolerant or Fail Silent, X-By-Wire systems. ACKNOWLEDGMENTS The author would like to thank my colleagues for their enduring patience and for their gracious technical input and editorial support along with their encouragement in the preparation of this manuscript. John Waidner - Chassis System Start Center, Sr. Systems Engineer. Troy Helm - Chassis Systems, Software Engineer Rob A. Perisho Jr. - Advanced Chassis Systems, Sr. Project Engineer Charles Duncan - Chassis SystemsCompetency Leader James Spall - Chassis Systems Start Center Team Leader Brian T. Murray Ph.D. - Chassis Systems Research Ctr. 12 REFERENCES 1. R. A. Frowerk, Signature analysis: a new digital field service method Hewlett-Packard Journal, pp. 2-8 May 1977 2. H. J. Nadig, Signature analysis Concepts, Examples, and Guidelines Hewlett-Packard Journal, pp. 15-21 May 1977. 3. S.W. Golomb, Shift-Register Sequences, Holden- Day, Inc., San Francisco, 1967 4. J. Sosnowski, Concurrent Error Detection Using Signature Monitors. Proc of Fault Tolerant Computing Systems, Methods, Applications. 4th International GI/ITG/GMA Conference, Sept 1989, pp 343 - 355 5. K. Wilken, J.P.Shen, Continuous Signature Monitoring: Low Cost Concurrent Detection of Processor Control Errors, IEEE Transactions on Computer-Aided Design, Vol. 9, pp. 629-641 June 1990. 6. Intel Corporation, Embedded Pentium R Processor Family Developers Manual, Error Detection, Chapter 22, pp. 393 399 7. E. Bohl, T. Lindenkreuz, and R. Stephan, The Fail- Stop Controller AE11, Proc. Internationa Test Conference, IEEE Computer society Press, Los Alamitos, Calif., 1997, pp. 567-577 8. Bardel, W. H. McAnney, J.Savir, Built-In Test for VLSI: Pseudorandom Techniques, IBM Corp., John Wiley & Sons, 1987 9. J. Wakerly , Error Detecting Codes, Self-Checking Circuits and Applications, Elsevier North-Holland, Inc. 1978 section 2.1.6 error correction (syndrome testing) 10. Zvi Kohavi, Switching and Finite automa Therory, McGraw-Hill Inc. , 1978, section 1.3, pp 14 21. ADDITIONAL SOURCES 1. J. Sosnowski, Evaluation of transient hazards in microprocessor controllers, in Proc. of 18 th IEEE FTCS, 1986, pp. 364 - 369 2. P.K. Lala, Digital Circuit Testing and Testability, Academic Press Inc., 1997. 3. D. A. Anderson, G. Metze, Design of Totally Self Checking Circuits for m-Out-of-n Codes, IEEE Transactions On Computers, Vol C-22, NO. 3, March 1973 CONTACT Terry L. FruehlingChassis Systems Start Center Sr. Systems Engineer 765-451-5431 terry.l.fruehling@delphiauto.com