Fault-Tolerance Techniques for Spacecraft Control Computers

Fault-Tolerance Techniques for Spacecraft Control Computers

von: Mengfei Yang, Gengxin Hua, Yanjun Feng, Jian Gong

Wiley, 2017

ISBN: 9781119107415 , 344 Seiten

Format: ePUB

Kopierschutz: DRM

Mac OSX,Windows PC für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Apple iPod touch, iPhone und Android Smartphones

Preis: 128,99 EUR

eBook anfordern eBook anfordern

Mehr zum Inhalt

Fault-Tolerance Techniques for Spacecraft Control Computers


 

1
Introduction


A control computer is one of the key equipment in a spacecraft control system. Its reliability is critical to the operations of the spacecraft. Furthermore, the success of a space mission hinges on failure‐free operation of the control computer. In a mission flight, a spacecraft’s long‐term operation in a hostile space environment without maintenance requires a highly reliable control computer, which usually employs multiple fault‐tolerance techniques in the design phase. With focus on the spacecraft control computer’s characteristics and reliability requirements, this chapter provides an overview of fundamental fault‐tolerance concepts and principles, analyzes the space environment, emphasizes the importance of fault‐tolerance techniques in the spacecraft control computer, and summarizes the current status and future development direction of fault‐tolerance technology.

1.1 Fundamental Concepts and Principles of Fault‐tolerance Techniques


Fault‐tolerance technology is an important approach to guarantee the dependability of a spacecraft control computer. It improves system reliability through implementation of multiple redundancies. This section briefly introduces its fundamental concepts and principles.

1.1.1 Fundamental Concepts


“Fault‐tolerance” refers to “a system’s ability to function properly in the event of one or more component faults,” which means the failure of a component or a subsystem should not result in failure of the system. The essential idea is to achieve a highly reliable system using components that may have only standard reliability [1]. A fault‐tolerant computer system is defined as a system that is designed to continue fulfilling assigned tasks even in the event of hardware faults and/or software errors. The techniques used to design and analyze fault‐tolerant computer systems are called fault‐tolerance techniques. The combination of theories and research related to fault‐tolerant computer techniques is termed fault‐tolerant computing [2–4].

System reliability assurance depends on the implementation of fault‐tolerance technology. Before the discussion of fault‐tolerance, it is necessary to clarify the following concepts [4,5]:

  1. Fault: a physical defect in hardware, imperfection in design and manufacturing, or bugs in software.
  2. Error: information inaccuracy or incorrect status resulting from a fault.
  3. Failure: a system’s inability to provide the target service.

A fault can either be explicit or implicit. An error is a consequence and manifestation of a fault. A failure is defined as a system’s inability to function. A system error may or may not result in system failure – that is, a system with a fault or error may still be able to complete its inherent function, which serves as the foundation of fault‐tolerance theory. Because there are no clearly defined boundaries, concepts 1, 2, and 3 above are usually collectively known as “fault” (failure).

Faults can be divided into five categories on the basis of their pattern of manifestation, as shown in Figure 1.1.

Figure 1.1 Fault categorization.

“Permanent fault” can be interpreted as permanent component failure. “Transient fault” refers to the component’s failure at a certain time. “Intermittent fault” refers to recurring component failure – sometimes a failure occurs, sometimes it does not. When there is no fault, the system operates properly; when there is a fault, the component fails. A “benign fault” only results in the failure of a component, which is relatively easy to handle. A “malicious fault” causes the failed component to appear normal, or transmit inaccurate values to different receivers as a result of malfunction – hence, it is more hostile.

Currently, the following three fault‐tolerant strategies are utilized [4–6]:

  1. Fault masking. This strategy prevents faults from entering the system through redundancy design, so that faults are transparent to the system, having no influence. It is mainly applied in systems that require high reliability and real‐time performance. The major methods include memory correction code and majority voting. This type of method is also called static redundancy.
  2. Reconfiguration. This strategy recovers system operation through fault removal. It includes the following steps:
    • Fault detection – fault determination, which is a necessary condition for system recovery;
    • Fault location – used to determine the position of the fault;
    • Fault isolation – used to isolate the fault to prevent its propagation to other parts of the system;
    • Fault recovery – used to recover system operation through reconfiguration.

      This method is also defined as dynamic redundancy.

  3. Integration of fault masking and reconfiguration. This integration realizes system fault‐tolerance through the combination of static redundancy and dynamic redundancy, also called hybrid redundancy.

In addition to strategies 1, 2, and 3 above, analysis shows that, in certain scenarios, it is possible to achieve fault‐tolerance through degraded redundancy. Since degraded redundancy reduces or incompletely implements system function, this book will not provide further discussion on it.

The key to fault tolerance is redundancy – no redundancy, no fault‐tolerance. Computer system fault‐tolerance consists of two types of redundancies: time redundancy and space redundancy. In time redundancy, the computation and transmission of data are repeated, and the result is compared to a stored copy of the previous result. In space redundancy, additional resources, such as components, functions or data items, are provided for a fault‐free operation.

Redundancy necessitates additional resources for fault‐tolerance. The redundancies in the above two categories can be further divided into four types of redundancies: hardware redundancy, software redundancy, information redundancy, and time redundancy. In general, hardware failure is solved with hardware redundancy, information redundancy, and time redundancy, while software failure is solved with software redundancy and time redundancy.

  1. Hardware redundancy: In this type of redundancy, the effect of a fault is obviated through extra hardware resources (e.g., using two CPUs to achieve the same function). In this scenario, the failure of one CPU can be detected through comparison of the two results. If there are triple CPUs, masking of one CPU’s failure is achieved through majority voting – a typical static redundancy strategy. It is possible to set up a dynamic fault‐tolerant system through multiple hardware redundancies, such that backup components replace the ones that fail. Hybrid redundancy incorporates static and dynamic redundancy. Hardware redundancy, which ranges from simple backup to complex fault tolerance structures, is the most widely used and basic redundancy method, and is related to the other three because they all need extra resources.
  2. Software redundancy: In this type of redundancy, faults are detected and fault tolerance achieved by using extra software. Using the rationale that different people will not make the same mistake, fault tolerance is achieved by developing different versions of the same software using different teams, to avoid the same errors being induced by certain inputs.
  3. Information redundancy: This type of redundancy achieves fault‐tolerance through extra information (e.g., error correcting code is a typical information redundancy method). Information redundancy needs the support of hardware redundancy to complete error detection and correction.
  4. Time redundancy: In this type of redundancy, fault detection and fault‐tolerance are achieved over time – for example, a user may repetitively execute certain program on certain hardware, or adopt a two‐out‐of‐three strategy with the result for an important program.

Because of the extra resources involved, redundancy inevitably affects system performance, size, weight, function, and reliability. In the design phase of a computer system with high‐reliability requirement, it is necessary to balance all application requirements to select the appropriate redundancy method and fault tolerance structure. In order to reflect all aspects of a fault‐tolerant computer system’s implementation and research, this book covers system architecture, fault detection, bus, software, FPGA, and fault injection, and introduces intelligence fault tolerance technology.

1.1.2 Reliability Principles


1.1.2.1 Reliability Metrics

Qualitative and quantitative analysis and estimation are essential in the design of fault‐tolerant computer systems. The major features involved are reliability, availability, maintainability, safety, performability, and testability, with each feature having its own qualitative and quantitative specifications [4,5,7].

  1. Reliability and its measurement (R(t))

    Reliability is the ability of a system to function under stated time and conditions. Assume that the system is operating normally at t0. The conditional probability that...