In technology , especially in data processing , fault tolerance (from the Latin tolerare , `` suffer '', `` endure '') means the property of a technical system to maintain its functionality even if unforeseen inputs or errors occur in the hardware or software .
Fault tolerance increases the reliability of a system, as is required, for example, in medical technology or in aerospace technology . Fault tolerance is also a prerequisite for high availability , which plays an important role in telecommunications technology in particular .
Approaches on different levels
Fault tolerance can be achieved at different levels. Depending on the area of application (PC, medical technology, space technology, etc.), different approaches make sense, and combinations are often possible.
Fault tolerance in hardware
Hardware, d. H. an electronic circuit , e.g. B. can be made fault tolerant by adding "hot" redundancy . "Cold" redundancy, on the other hand, requires the intervention of another system (operator, software, etc.) and therefore does not, on its own, meet the fault tolerance requirement.
Running z. B. two implementations of a circuit in parallel (dual modular redundancy, DMR), a decision unit can determine an error by comparing the outputs of the two components, but not correct it. If another instance of the components is added (triple modular redundancy, TMR), a decision unit can also correct an error. If the defective unit is marked as defective, an error can still be identified (as with DMR). If a TMR is required for the safe operation of a system, 4 or more redundant components are used.
Fault tolerance in software
At the software level, fault tolerance can be achieved through the following measures:
- Design diversity : different implementations of an algorithm run in parallel
- Data diversity: the input data are slightly modified and processed several times (e.g. good against rounding errors )
- Temporal diversity: an algorithm is called multiple times with the same data (e.g. good against short-term hardware errors)
Fault tolerance in user interfaces
Incorrect user input, i.e. human error , often causes abnormal operating conditions. Error tolerance is therefore one of the design principles for dialogues according to EN ISO 9241 , Section 110 (principles of dialog design). A dialog is error-tolerant if the intended work result can be achieved either with no or with minimal correction effort by the user despite recognizable incorrect entries:
- Support in the discovery and avoidance of input errors ( plausibility check )
- No system crashes or undefined system states
- Error explanations for correction purposes
- Additional display effort for error localization
- Automatic error correction with information
- Postponable error handling
- Additional explanations on request
- Examination and confirmation before execution
- Error correction without changing the state of the dialog
The potential errors that visitors cause or that they may encounter can be classified as follows:
- Avoidable Mistakes
These types of errors occur due to a lack of preoccupation with user behavior and could be avoided if the target group and their typical usage behavior were carefully examined. Typically avoidable user errors on websites are navigation errors or incorrect entries on forms. Extensive testing prior to launching a website or application could avoid many of these errors.
- Known, unavoidable mistakes
Not all known errors can be avoided. Typing errors on the keyboard, accidentally submitting a form that has not yet been completely filled out are just two examples of errors that one must reckon with because they cannot be ruled out. There must therefore be simple, clearly recognizable correction options for all foreseeable errors.
- Unforeseeable errors
This class of errors includes all those that occur due to unexpected user behavior or are caused by programming errors that are difficult to identify. Most of the time, these errors lead to opaque program behavior that cannot be understood by the user. A typical case would be, for example, an error due to the use of non-normalized times. When changing from summer to normal time , hour 2 is run through twice, which means that unique time stamps may appear twice or time measurements appear to end at an earlier time than they began. This behavior occurs only once a year and cannot be reproduced on the remaining days of the year. In such cases, the solution is usually to use UTC or normalized local time (local time without daylight saving time shift).
Levels of fault tolerance
As a rule, a distinction is made between the following levels of fault tolerance:
|go||The system reacts safely and correctly.|
|fail-operational||System fault tolerant with no degradation in performance|
|fail-soft||System operation safe, but performance reduced|
|fail-safe||Only system security guaranteed|
|fail-unsafe||unpredictable system behavior|
Reaction and correction of errors
When reacting or correcting errors, a distinction is made between the two principles of forward and backward error correction.
Forward error correction
With forward error correction, the system tries to continue as if no error had occurred, for example by compensating incorrect input values using empirical values from the past or input values from correctly functioning input interfaces or immediately continuing to work with correctly functioning substitute systems at the moment an error occurs. Errors usually remain invisible to users during forward error correction.
Backward error correction
With backward error correction, if an error occurs, the system tries to return to a state prior to this occurrence, for example to the state immediately before an incorrect calculation, in order to carry out this calculation again. A change of state to emergency operation or z. B. a restart of the system is possible. If an incorrect calculation can be repeated successfully, the error remains invisible to the user even with backward error correction. Often, however, only continued operation with reduced performance or limited functionality is possible and the error is thus visible.
In space technology , errors are corrected by evaluating satellite telemetry data in the ground station by system experts and switching functions using telecommands. Since the enormous advances in on-board data processing (fast processors, large data memories, intelligent software concepts), the data evaluation and switching for error correction is carried out more and more autonomously by the satellite system itself.
Because of the extensive checking measures required for a complex on-board system and the associated expenditure of time and money, the increasing autonomy with regard to error correction is only realized in small steps, because unlike systems on Earth, incorrect error correction can lead to the complete loss of a satellite.
- Thomas Becker: Transparent fault tolerance in distributed systems. Shaker, ISBN 3-8265-1194-8 .
- Jürgen Eich: Error tolerance through robust control using the example of a redundant electrohydraulic actuator. Shaker, ISBN 3-8265-6229-1 .
- Stefan Petri: Load balancing and fault tolerance in workstation clusters. Shaker, ISBN 3-8265-2471-3 .
- Alexander Krautstrunk: fault-tolerant actuator concept for safety-relevant applications. Shaker, ISBN 3-8322-4203-1 .
- Fault Tolerance Discussion Group 2005. Shaker, ISBN 3-8322-4427-1 .
- Sergio Montenegro: Safe and fault-tolerant controls: Development of safety-relevant systems. Hanser, ISBN 3-446-21235-3 .
- Karsten Grans: The duplex system with reverse correction - a combined redundant fault tolerance method for distributed systems. Logos, ISBN 3-89722-591-3 .
- Klaus Echtle: fault tolerance method . Springer, ISBN 3-540-52680-3 .
- Jürgen Stoll: Fault tolerance in distributed real-time systems - application-oriented techniques. Springer, ISBN 3-540-52331-6 .
- Hubert Mäncher: Fault-tolerant, decentralized process automation. Springer, ISBN 3-540-18754-5 .
- Rolf Hedtke: Microprocessor systems: reliability, test procedures, fault tolerance. Springer, ISBN 3-540-12996-0 .
- Winfried Görke, H. Sörensen: Error-tolerating computing systems. Springer, ISBN 3-540-51565-8 .
- J. Schneider: Error reaction with programmable logic controllers - a contribution to error tolerance. Springer, ISBN 3-540-58170-7 .
- Franz-Josef Markus: Distributed dynamic and fault-tolerant process assignment for multicomputers with an integrated graphic development environment. Tectum, ISBN 3-8288-1082-9 .
- Uwe Gläser, Uwe Steinhausen: Error detection and error tolerance in associative RAM memory (ARAM). GMD Research Center for Information Technology, ISBN 3-88457-172-9 .
- Jürgen Nikolaizik, Boris Nikolov, Joachim Warlitz: Fault-tolerant microcomputer systems. Verlag Technik, ISBN 3-341-00859-4 .
- Sven Nilsson: Concept and architecture of a fault-tolerant multi-microcomputer system. ISBN 3-8107-2148-4 .
- Mario DalCin: Fault Tolerant Systems: Models of Reliability, Availability, Diagnosis and Renewal. ISBN 3-519-02352-0 .
- Lavrentios Servissoglou: TUFT - Tübingen fault tolerance for message exchange systems. ISBN 3-89722-037-7 .
- Norbert Becker: Design and implementation of a fault-tolerant data acquisition system for PLC controls. ISBN 3-931216-34-9 .