Quorum (computer science)

A quorum or a voting disk is a component of the cluster manager of a computer cluster to maintain data integrity in the event of a partial failure. If the cluster interconnect (the connection between the cluster nodes) fails, there is a risk of the overall system splitting up into undesirably autonomous units, which almost always threatens the data integrity ( split-brain problem ). In the event of an interrupted interconnect, alternate or concurrent writing to the logical structure of the voting disk is used to decide which part of the cluster should survive. The voting disk is on shared storage .

An example in the case of the Oracle RAC surviving:

with asymmetrical division (e.g. 2: 3 knots) the larger part
if the division is even (e.g. 2: 2 nodes), the part with the larger workload.

Such a differentiation after a failure of the interconnect as a communication channel would be impossible without “coordination” with mass storage. Since almost all cluster managers react to a failure of the interconnection by restarting at least one node, the persistent storage of the cluster status in the voting disk is also an advantage: a large part of the renegotiations about availabilities and master status is not necessary. Without the persistent voting disk, these negotiations often require several reboots. With the use of a quorum, the availability of the individual nodes increases by eliminating the need for restart cycles.

Problems and solutions

The voting disk itself is - as soon as it is used - an integral part of the cluster. If a previously available quorum is suddenly no longer available during cluster operation, the entire system fails. Of course, this applies in particular in the event of a single shared storage failure . All manufacturers of clusterware are currently striving to avoid the resulting single point of failure .

The common approach to solving these problems is to mirror the voting disk across multiple physical media. However, there are other problems:

The voting disks must be guaranteed to be consistent and have the lowest possible latency .
A split-brain scenario with the distribution of the voting disks between the potentially autonomous sub-units would also be counterproductive . This conflict resolves z. B. Oracle Clusterware with an odd number of quorums.

The ultimate solution for consistency, latency and availability problems is the (possibly very expensive) storage-side replication in the Storage Area Network (SAN). It transparently presents a single replicated device to all cluster members and thereby relieves the load on clusterware , cluster members and administrators.

literature

Chanda Ray: Distributed Database Systems . Pearson Education India, 2009, ISBN 978-81-317-2718-8 , Chapter 10: Distributed Recovery Management.