Single point of failure

from Wikipedia, the free encyclopedia

Under a single point of failure (short SPOF or German single point of failure ) is defined as a part of a technical system whose failure pulls the failure of the entire system by itself.

In the case of high-availability systems, it must be ensured that all components of a system are designed redundantly . Diversity should also play a role. Systems of different structures (for example, different manufacturers) are used for the same task. This makes a simultaneous failure of several systems for a single reason less likely.

principle

Depending on the requirements, redundant devices may not be operated in the same location, as otherwise a SPOF still exists:

  • In the accident at Fukushima , the diesel-powered emergency power generators were present several times, which is sufficient protection in many damage scenarios. However, the tsunami destroyed a large part of the emergency generators. Since then, mobile emergency power devices and those that are located on a hill safe from flooding have been implemented.
  • The same problem arises with data backup : If data is backed up to an external hard drive and this is stored in your own office, this is certainly sufficient in the scenarios “ laptop destroyed by coffee spilled” and “thief steals the laptop”, but no longer when a fire destroys the entire office. In such a case, the hard drive would have to be stored in a safe deposit box , for example , to ensure redundancy.

In the IT sector

As simple steps to avoid multiple SPOFs in IT operations, you can use multiple uninterruptible power supplies (UPS) , make parts of the server redundant (power supply units and network cards) and sufficiently increase the number of end devices used.

When connecting to several transformers of the energy supplier , integrating cross cabling (several current paths), one or more generators as an emergency power system , IT systems with several power supplies or using several STS (ballasts), redundant air conditioning and multiple accesses to the end devices the company's own network (CorporateNetwork) provides an infrastructure that is largely protected against failures. The next increase in availability is achieved by using internally highly redundant ( fault-tolerant ) servers or cluster systems. In addition, backup data centers can be used in the event of a disaster.

example

In a company, the computer network should be secured against power and server failures. "SPOF" means a single element, the failure of which affects the entire system.

Aviation

In aviation , avoiding single points of failure is of paramount importance. However, if a failure does not affect safety, or if safety analyzes confirm that the failure occurs sufficiently rarely, a single point of failure is permissible.

The FAA divides the on-board systems - due to their possible malfunctions - into the following categories:

  • Minor failure (may occur more than 1 per 100,000 operating hours, has no impact on safety)
  • Major Failure (must occur less than once per 100,000 hours, all occupants survive the incident)
  • Hazardous Failure (must occur less than once every 10 million operating hours, requires high flying skills, some occupants die in the incident)
  • Catastrophic Failure (must occur less than once per billion operating hours, the aircraft is irretrievably lost even with the best of the pilots, most passengers die in the incident)

If a system failure means a major failure , a single, non-fail-safe design is permitted. In contrast, the failure of a single system should not result in a catastrophic failure .

Measuring devices and avionics

A modern aircraft continuously processes dozens of different measured values: altitude , speed, position values ​​of the inertial navigation system , engine data , the reception of signals from the instrument landing system and many more.

Depending on the requirements, this data must not only be collected in a fail-safe manner, but also processed in a fail-safe manner. In modern aircraft, three independent flight control computers are used, which obtain the raw data from three independent sources ( pitot tubes , static probes ...).

Assuming that the dual malfunction of a system is extremely unlikely, a triple system can detect the correct reading and discard the wrong one. If only two systems are still active, the system can at least notify the pilots of a questionable measured value - but it can no longer decide which of the two values ​​is correct.

control

The control systems of a commercial aircraft were designed twice or even three times for reasons of safety and at the same time were no longer practicable for the more modern large-capacity aircraft Boeing 747 , Lockheed Tristar or Douglas DC-10 in a purely mechanical form using levers, rods or cables. They were replaced by hydraulic and later electrical / electronic systems to transmit the control commands to the flap drives (so-called fly-by-wire ). Using electrical signal transmission, it was now much easier to ensure the necessary redundancy.

In the case of hydraulic signal transmission, against all odds, it was possible in individual cases that all three systems were damaged by the same incident and failed due to the proximity of the redundant systems. On United Airlines Flight 232, for example, splintering engine parts of a DC-10 destroyed all three hydraulic systems. On Japan Air Lines flight 123 , all four systems of a Boeing 747 were destroyed after a pressure release from the pressurized cabin.

The EBHA (electric back-up hydraulic actuator) on board the Airbus 380 and Gulfstream 650 represents a further improvement . Normally, hydraulically operated actuators are controlled electrically / electronically; the failure of the hydraulic lines would shut down such an actuator anyway. An EBHA, on the other hand, has its own, self-sufficient hydraulic system. EBHAs make it possible to save one of three hydraulic systems and thus weight.

References

  1. ^ Tadashi Narabayashi: Countermeasures derived from the lessons of the Fukushima Daiichi nuclear power plant accident. In: Proceedings of the 2013 21st International Conference on Nuclear Engineering. Retrieved July 17, 2019 .
  2. AC 25.1309-1A "Systems Design and Analysis", FAA, see AC 25.1309-1
  3. Eyewitness Report: United Flight 232 ( Memento from April 18, 2001 in the Internet Archive )
  4. https://australianaviation.com.au/2011/01/g650-flies-with-electric-backup-hydraulic-actuators/