HACMP

The Cluster Manager for AIX is called HACMP ( High Availability Cluster Multi-Processing ). It is used for applications that have to have high availability . These are usually business-critical applications (e.g. the accounting system for securities transactions at a bank).

With version 6.1, HACMP was renamed to PowerHA. Even if the software is no longer called that, the designation HACMP - even for new versions - is still common among experts.

With version 7.1, so-called SmartAssists were introduced, which are intended to enable automatic detection and configuration of various applications as an HA solution.

functionality

Participating machines in a HACMP cluster are called nodes . So-called Resource Groups (RG) run on these nodes , which represent the central term in HACMP: an RG is the logical summary

one or more file systems
one or more IP addresses
one or more processes and associated start / stop scripts

When such a resource group is activated on a cluster node, the associated file systems are first mounted, then the RG processes are started with the help of the start / stop scripts stored in the RG definition. Then the IP address (the so-called service IP ) is applied as an IP alias to a specific interface.

If the resource group is moved to another cluster node ( takeover ), the application is only terminated with the help of the stop script, the file systems are unmounted and the IP alias with the service IP is deleted, then the next node in the sequence is the Activation process (see above) processed. For the client there is only a short interruption (the time necessary for the change) until the service is available again under the same IP address. The client does not notice that this IP address now represents another machine.

Most of the functions in HACMP or PowerHA are handled by scripts (in the Kornshell ), only a small kernel patch (the so-called Dead-Man-Switch , DMS ) directly changes the underlying operating system. This open architecture makes HACMP very flexible.

The biggest problem that cluster software has to solve is the so-called split brain condition : both nodes believe that they are or have to become the active one. In HACMP / PowerHA, when configuring the cluster, various communication paths are defined for this purpose, via which the cluster nodes can send messages about their functionality to one another. This is called a heartbeat and can be about

Specially set up IP interfaces
those hdisk devices that both nodes must be able to access (disk heartbeat, or "target mode disk" for older versions)
serial lines (the classic method and essential up to HACMP 4.4)

be accomplished. If a node comes to the conclusion that it can no longer communicate with the partner or the outside world due to heartbeats that are no longer received, the dead-man switch is triggered and the node either switches itself off or restarts, depending on the configuration. The active node also checks whether communication with the clients is still possible before it switches off so that the standby node can take over.

Typical configurations

A large number of cluster configurations are possible with HACMP / PowerHA, the most common by far are active / passive clusters ( called rotating clusters in HACMP jargon ) and active / active clusters ( cascading clusters ).

Rotating cluster

The Resource Group runs on one of usually two (but also more if required) nodes, while only the operating system and the cluster manager run on the other node . If the active node fails, the other performs a takeover. The mode is called rotating because the resource group is moved back and forth between the nodes, so it "rotates".

This operating mode is preferred for systems that are absolutely necessary and has the advantage of being easy to plan with relatively little complexity. The disadvantage is that a significant portion of the capacity (the standby node (s)) is idle most of the time.

Cascading cluster

The resource group with the main application runs on one node and resource groups run on another node, which can be switched off if necessary. In the event of an error, the standby node first executes the stop scripts of its own resource groups, then a takeover to the RG of the main application is carried out.

This operating mode is typical for systems in which a productive instance is opposed to one or more test or development instances, for example with SAP ERP or larger databases . The test instances are then operated on the standby node as long as no error occurs. In the event of an error, it takes over the productive instance and the test instances are not available until then.

Web links

IBM High Availability Cluster Multiprocessing for AIX 5L
(PDF) HACMP Best Practices Whitepaper ( Memento from July 20, 2009 in the Internet Archive )
HACMP for AIX documentation