A computer network or computer cluster , usually simply called a cluster (from the English for “computer swarm”, “group” or “heap”), describes a number of networked computers . The term is used to summarize two different tasks: increasing computing capacity (HPC cluster, high performance computing) and increasing availability (HA cluster, high available). The computer contained in a cluster (also nodes , from the English nodes or servers ) are often called server farm called.
The term cluster primarily describes the architecture of the individual building blocks and their interaction. Hardware or software clusters are fundamentally different. The simple form of a hardware cluster is known as active / passive . Other variations are known as cascading . An interruption in the service must be taken into account. HP OpenVMS clusters are able to implement a hardware active / active functionality.
Software clusters or application clusters, on the other hand, are more capable of realizing continuous operation (example: DNS server). However, it depends on the client in the client / server architecture whether he can handle the switchover of the service.
A distinction is made between so-called homogeneous and heterogeneous clusters. Computers in homogeneous clusters run under the same operating system and the same hardware ; different operating systems or hardware can be used in heterogeneous clusters. Well-known Linux cluster software are z. B. HP Serviceguard , Beowulf and openMosix .
High availability cluster
High Availability Cluster (Engl. High-Availability clusters - HA clusters) to increase the availability of used or for better reliability. If an error occurs on one node of the cluster, the services running on this node are migrated to another node. Most HA clusters have 2 nodes. There are clusters in which services run continuously on all nodes. These clusters are called active-active or symmetrical. If not all nodes are active, one speaks of active-passive or asymmetrical. Both the hardware and the software of an HA cluster must be free of single point of failures (components that would cause the entire system to fail due to an error). Such HA clusters are used in critical environments in which downtimes of just a few minutes per year are permitted. Critical computer systems have to be secured in the context of disaster scenarios. For this purpose, the cluster nodes are often placed several kilometers apart in different data centers. In the event of a disaster, the node in the unaffected data center can take over the entire load. This type of cluster is also called “stretched cluster”.
Load balancing cluster
Load balancing clusters are set up to distribute the load over several machines. The load is usually distributed via a redundant, central instance. Possible areas of application are environments with high demands on computer performance. The performance requirement is not covered here by upgrading individual computers, but by adding additional computers. The reason for the use is not least the use of inexpensive standard computers ( COTS components) instead of expensive special computers.
High performance computing cluster
High-performance computing clusters (HPC clusters) are used to process computing tasks. These arithmetic tasks are divided between several nodes. Either the tasks are divided into different packages and executed in parallel on several nodes or the computing tasks ( called jobs ) are distributed to the individual nodes. The division of the jobs is usually done by a job management system . HPC clusters are often found in the scientific field. As a rule, the individual elements of a cluster are connected to one another via a fast network . The so-called render farms also fall into this category.
The first commercially available cluster product was ARCNET , which was developed by Datapoint in 1977 . The company DEC had its first real success in 1983 with the presentation of the VAXCluster product for its VAX computer system . The product not only supports parallel computing on the cluster nodes, but also the common use of file systems and devices of all nodes involved. These properties are still missing from many free and commercial products today. VAXCluster is still available today as "VMSCluster" from HP for the OpenVMS operating system and the Alpha and Itanium processors .
The failover function is usually provided by the operating system (service failover , IP takeover). The takeover of services can e.g. B. can be achieved by the automatic migration of IP addresses or the use of a multicast address .
A general distinction is made between the architectures shared nothing and shared all .
A typical representative of the “active-active” cluster with shared-nothing architecture is DB2 with EEE (pronounced “triple e”). Here, each cluster node houses its own data partition. A performance gain is achieved through the partitioning of the data and the associated distributed processing. Failure safety is not guaranteed.
This is different with the "shared-all" cluster. This architecture ensures through competitive access to shared storage that all cluster nodes can access the entire database. In addition to scaling and increased performance, this architecture also provides additional reliability. If one node fails, the other nodes take over its task (s). A typical representative of the shared-all architecture is the Oracle Real Application Cluster (RAC).
HA computer clusters can also boot directly from a storage area network (SAN) as a “single system image” without local data carriers . Such diskless shared root clusters facilitate the exchange of cluster nodes, which in such a configuration only make their computing power and I / O bandwidth available.
Services must be specially programmed for use on a cluster. A service is referred to as “cluster aware” if it reacts to special events (such as the failure of a cluster node) and processes them in a suitable manner.
Cluster software can be implemented in the form of scripts or integrated in the operating system kernels.
In HPC clusters, the task to be done, the “job”, is often broken down into smaller parts using a decomposition program and then distributed to the nodes.
Communication between job parts that run on different nodes usually takes place using a Message Passing Interface (MPI), since fast communication between individual processes is desired. To do this, you couple the nodes with a fast network such as B. InfiniBand .
A common method for distributing jobs to an HPC cluster is a job scheduling program, which can distribute them according to different categories, such as: B. Load Sharing Facility (LSF) or Network Queuing System (NQS).
- The Beowulf Project - Distributed Computing
- heartbeat - HA cluster software (new link:  )
- HACMP (PDF; 1.2 MB) - HA cluster software from IBM for AIX (English)
- Kerrighed - Distributed Computing
- Kimberlite - HA Failover Cluster
- MC / Service Guard - HA cluster software from HP for HP-UX and Linux
- MPI - The Message Passing Interface (MPI) standard
- MOSIX - Cluster and Multi-Cluster Management
- Open Mosix - Distributed Computing (project discontinued)
- Oracle RAC - Cluster software from Oracle (English)
- Proxmox VE - KVM virtualization software
- Solaris Cluster - Cluster software from Sun Microsystems (English)
- Veritas Cluster Server for AIX , HP-UX , Linux ( Red Hat & Suse ), Solaris and Microsoft (Windows 2000 & Windows 2003)
- VMSCluster - VMSCluster from HP for OpenVMS
- wackamole - HA cluster software (moves IP addresses, project discontinued)
- Windows Server 2008 Failover Cluster HA cluster software from Microsoft
- Windows HPC Server 2008 R2 from Microsoft
- x10sure - HA cluster software from Fujitsu
- Active / active cluster
- Cluster file system
- Grid computing
- High availability
- Quorum (computer science)
- Parallel Sysplex
- Data center
- Shared storage
- Heiko Bauke, Stephan Mertens: Cluster Computing. Springer 2006, ISBN 3-540-42299-4
- Charles Bookman: Linux Clustering. ISBN 1-57870-274-7
- Hartmann Gebauer: Clustering with Windows NT. Addison-Wesley 1999, ISBN 3-8273-1403-8
- Andrea Held: Oracle 10g high availability. Addison-Wesley 2004, ISBN 3-8273-2163-8
- Michael Soltau: Unix / Linux high availability. MITP 2002, ISBN 3-8266-0775-9
- Martin Wieczorek, Uwe Naujoks, Bob Bartlett (eds.): Business Continuity. Springer 2003, ISBN 3-540-44285-5
- Volker Herminghaus, Albrecht Scriba: Veritas Storage Foundation, Springer (2006), ISBN 3-540-34610-4