Split Brain (Computer Science)

from Wikipedia, the free encyclopedia

In computer science, split brain is an undesirable state of a computer cluster in which all interconnections between the cluster parts are interrupted at the same time.

to form

Basically one differentiates between

  • the (separation) separation of a single node, an extreme example of this is the division of a 2-node cluster
  • the separation of a multi-node cluster (> 2) into two unequal parts
  • the separation of a multi-node cluster (> 2) into two equal parts
  • the separation of a multi-node cluster (> 2) into more than two parts. However, this last situation is viewed as several individual split-brain scenarios.

occurrence

Left: The complete, working cluster; Right: A split brain situation in which node E has failed.

A cluster interconnect or a quorum is usually used to coordinate the transactions in the cluster - depending on the technology used. If the connection between one or more parts of the cluster is interrupted in this way, no one can distinguish whether it is a partial failure or a separation. All of these (now isolated) cluster fragments continue to work for themselves to maintain the provision of the service (also known as "service"). Since the network connection to the public network (i.e. towards the user) still works normally , problems arise:

Effects

The basic problem of Split Brain is the fact that at least two parts still work, but coordination between them is no longer possible. While this does not seem to be an immediate problem with pure read access, write access leads to massive conflicts: The write operations are distributed over the parts of the cluster (which function but are isolated from each other), but the logic layer ( middle tier ) or the user noticed nothing unusual; From the user's point of view, the cluster behaves as in normal operation. However, due to the interrupted interconnection, the block written by node / part A cannot be read by node / part B - and vice versa.

The data statuses therefore diverge and the consistency of the data is no longer guaranteed. The restoration of all data after such a situation is normally no longer feasible or even completely impossible in a reasonable amount of time.

Countermeasures

The basis of all countermeasures is the simultaneous use of quorum and cluster interconnect : The separation of one of the two coordination options still allows a distinction to be made between division and partial failure.

The coverage of parallel failures (simultaneous loss of several critical parts) increases the effort enormously - for the split-brain prevention, for example, the use of several quorums and the use of parallelized / bonded interconnects intercepts the failure of interconnect and storage.

In the interplay between quorums and interconnect, reliably automated decision-making is necessary, for example with Oracle Clusterware the decision is made as follows:

It survives after losing the interconnect (order is observed):

  1. the part / node with the view of most of the quorums
  2. the part / node with the highest workload.

In order not to repeat the problem that was supposedly solved by several quorums (I see two quorums, you see two quorums, but we see two different pairs!), Oracle uses an odd number of these quorums. All nodes that meet in quorum must also see each other in the interconnect. If this is not the case, the load and topology information in the voting disk decide whether the node is alive or dead. The above decision list is expanded:

  1. the part / node with the view of most of the quorums
  2. the part with the most knots
  3. the part / node with the highest workload.

A cluster node can be given a higher rating (vote) in the quorum so that it always survives; the surviving node then forces the other node or nodes to give up (reboot), see STONITH . To ensure greater security, a witness is sometimes introduced, so at least 2 sources are always available or a weighting takes place.

See also

Individual evidence

  1. ^ Split Brain - Linux HA. Retrieved January 7, 2020 .
  2. STONITH - Linux-HA. Retrieved January 7, 2020 .