Fault tolerance

In technology , especially in data processing , fault tolerance (from the Latin tolerare , `` suffer '', `` endure '') means the property of a technical system to maintain its functionality even if unforeseen inputs or errors occur in the hardware or software .

Fault tolerance increases the reliability of a system, as is required, for example, in medical technology or in aerospace technology . Fault tolerance is also a prerequisite for high availability , which plays an important role in telecommunications technology in particular .

Approaches on different levels

Fault tolerance can be achieved at different levels. Depending on the area of application (PC, medical technology, space technology, etc.), different approaches make sense, and combinations are often possible.

Fault tolerance in hardware

Hardware, d. H. an electronic circuit , e.g. B. can be made fault tolerant by adding "hot" redundancy . "Cold" redundancy, on the other hand, requires the intervention of another system (operator, software, etc.) and therefore does not, on its own, meet the fault tolerance requirement.

Running z. B. two implementations of a circuit in parallel (dual modular redundancy, DMR), a decision unit can determine an error by comparing the outputs of the two components, but not correct it. If another instance of the components is added (triple modular redundancy, TMR), a decision unit can also correct an error. If the defective unit is marked as defective, an error can still be identified (as with DMR). If a TMR is required for the safe operation of a system, 4 or more redundant components are used.

Fault tolerance in software

At the software level, fault tolerance can be achieved through the following measures:

Design diversity : different implementations of an algorithm run in parallel
Data diversity: the input data are slightly modified and processed several times (e.g. good against rounding errors )
Temporal diversity: an algorithm is called multiple times with the same data (e.g. good against short-term hardware errors)

Fault tolerance in user interfaces

Incorrect user input, i.e. human error , often causes abnormal operating conditions. Error tolerance is therefore one of the design principles for dialogues according to EN ISO 9241 , Section 110 (principles of dialog design). A dialog is error-tolerant if the intended work result can be achieved either with no or with minimal correction effort by the user despite recognizable incorrect entries:

Support in the discovery and avoidance of input errors ( plausibility check )

No system crashes or undefined system states

Error explanations for correction purposes

Additional display effort for error localization

Automatic error correction with information

Postponable error handling

Additional explanations on request

Examination and confirmation before execution

Error correction without changing the state of the dialog

The potential errors that visitors cause or that they may encounter can be classified as follows:

Avoidable Mistakes

These types of errors occur due to a lack of preoccupation with user behavior and could be avoided if the target group and their typical usage behavior were carefully examined. Typically avoidable user errors on websites are navigation errors or incorrect entries on forms. Extensive testing prior to launching a website or application could avoid many of these errors.

Known, unavoidable mistakes

Not all known errors can be avoided. Typing errors on the keyboard, accidentally submitting a form that has not yet been completely filled out are just two examples of errors that one must reckon with because they cannot be ruled out. There must therefore be simple, clearly recognizable correction options for all foreseeable errors.

Unforeseeable errors

This class of errors includes all those that occur due to unexpected user behavior or are caused by programming errors that are difficult to identify. Most of the time, these errors lead to opaque program behavior that cannot be understood by the user. A typical case would be, for example, an error due to the use of non-normalized times. When changing from summer to normal time , hour 2 is run through twice, which means that unique time stamps may appear twice or time measurements appear to end at an earlier time than they began. This behavior occurs only once a year and cannot be reproduced on the remaining days of the year. In such cases, the solution is usually to use UTC or normalized local time (local time without daylight saving time shift).

Levels of fault tolerance

As a rule, a distinction is made between the following levels of fault tolerance:

step	System behavior
go	The system reacts safely and correctly.
fail-operational	System fault tolerant with no degradation in performance
fail-soft	System operation safe, but performance reduced
fail-safe	Only system security guaranteed
fail-unsafe	unpredictable system behavior

Reaction and correction of errors

When reacting or correcting errors, a distinction is made between the two principles of forward and backward error correction.

Forward error correction

With forward error correction, the system tries to continue as if no error had occurred, for example by compensating incorrect input values using empirical values from the past or input values from correctly functioning input interfaces or immediately continuing to work with correctly functioning substitute systems at the moment an error occurs. Errors usually remain invisible to users during forward error correction.

Backward error correction

With backward error correction, if an error occurs, the system tries to return to a state prior to this occurrence, for example to the state immediately before an incorrect calculation, in order to carry out this calculation again. A change of state to emergency operation or z. B. a restart of the system is possible. If an incorrect calculation can be repeated successfully, the error remains invisible to the user even with backward error correction. Often, however, only continued operation with reduced performance or limited functionality is possible and the error is thus visible.

External correction

In space technology , errors are corrected by evaluating satellite telemetry data in the ground station by system experts and switching functions using telecommands. Since the enormous advances in on-board data processing (fast processors, large data memories, intelligent software concepts), the data evaluation and switching for error correction is carried out more and more autonomously by the satellite system itself.

Because of the extensive checking measures required for a complex on-board system and the associated expenditure of time and money, the increasing autonomy with regard to error correction is only realized in small steps, because unlike systems on Earth, incorrect error correction can lead to the complete loss of a satellite.

literature

Thomas Becker: Transparent fault tolerance in distributed systems. Shaker, ISBN 3-8265-1194-8 .

Jürgen Eich: Error tolerance through robust control using the example of a redundant electrohydraulic actuator. Shaker, ISBN 3-8265-6229-1 .

Stefan Petri: Load balancing and fault tolerance in workstation clusters. Shaker, ISBN 3-8265-2471-3 .

Alexander Krautstrunk: fault-tolerant actuator concept for safety-relevant applications. Shaker, ISBN 3-8322-4203-1 .

Fault Tolerance Discussion Group 2005. Shaker, ISBN 3-8322-4427-1 .

Sergio Montenegro: Safe and fault-tolerant controls: Development of safety-relevant systems. Hanser, ISBN 3-446-21235-3 .

Karsten Grans: The duplex system with reverse correction - a combined redundant fault tolerance method for distributed systems. Logos, ISBN 3-89722-591-3 .

Klaus Echtle: fault tolerance method . Springer, ISBN 3-540-52680-3 .

Jürgen Stoll: Fault tolerance in distributed real-time systems - application-oriented techniques. Springer, ISBN 3-540-52331-6 .

Hubert Mäncher: Fault-tolerant, decentralized process automation. Springer, ISBN 3-540-18754-5 .

Rolf Hedtke: Microprocessor systems: reliability, test procedures, fault tolerance. Springer, ISBN 3-540-12996-0 .

Winfried Görke, H. Sörensen: Error-tolerating computing systems. Springer, ISBN 3-540-51565-8 .

J. Schneider: Error reaction with programmable logic controllers - a contribution to error tolerance. Springer, ISBN 3-540-58170-7 .

Franz-Josef Markus: Distributed dynamic and fault-tolerant process assignment for multicomputers with an integrated graphic development environment. Tectum, ISBN 3-8288-1082-9 .

Uwe Gläser, Uwe Steinhausen: Error detection and error tolerance in associative RAM memory (ARAM). GMD Research Center for Information Technology, ISBN 3-88457-172-9 .

Jürgen Nikolaizik, Boris Nikolov, Joachim Warlitz: Fault-tolerant microcomputer systems. Verlag Technik, ISBN 3-341-00859-4 .

Sven Nilsson: Concept and architecture of a fault-tolerant multi-microcomputer system. ISBN 3-8107-2148-4 .

Mario DalCin: Fault Tolerant Systems: Models of Reliability, Availability, Diagnosis and Renewal. ISBN 3-519-02352-0 .
Lavrentios Servissoglou: TUFT - Tübingen fault tolerance for message exchange systems. ISBN 3-89722-037-7 .
Norbert Becker: Design and implementation of a fault-tolerant data acquisition system for PLC controls. ISBN 3-931216-34-9 .

Web links

The best known classical approaches of the fault tolerant search

Individual evidence

↑ Error tolerance ( memento of the original from March 16, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. . Retrieved October 27, 2012. @1@ 2

[1] Error tolerance ( memento of the original from March 16, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. . Retrieved October 27, 2012. @1@ 2