High Availability ( English highavailability , HA ) refers to the ability to guarantee despite failure of one of its components with a high probability (often 99.99% or better) the operation of a system. In contrast to the fault tolerance , operation can be interrupted in the event of a fault.
Availability and high availability
A system is said to be available when it is able to perform the tasks for which it is intended. As Availability the probability is referred to that a system is functional (available) within a specified time period. Availability is measured as the ratio of unplanned (error-related) downtime (= downtime) and total production time of a system:
The exact definition of high availability can vary. The Institute of Electrical and Electronics Engineers (IEEE) gives the following definition:
"High Availability (HA for short) refers to the availability of resources in a computer system, in the wake of component failures in the system."
Another definition of high availability is:
“A system is considered to be highly available if an application is still available in the event of an error and can continue to be used without direct human intervention. As a consequence, this means that the user perceives no or only a short interruption. High availability ( HA for short , derived from high availability ) describes the ability of a system to guarantee unrestricted operation if one of its components fails. "
High availability and availability classes
The question of the availability class from which a system is to be classified as highly available is answered differently depending on the definition of availability.
An availability of 99% does not generally define high availability; it is generally regarded nowadays as basic or normal, at least for high-quality IT equipment. As a result, high availability is only spoken of at 99.9% or higher. However, whether 3 * 9 are already sufficient or only 4 * 9 or 5 * 9 make a system a high-availability system depends on the source and manufacturer and has to be assessed under the respective application scenario. In general, a system can be classified as highly available if its annual downtime is in the range of a few minutes (~ 99.999% or AEC-2) or less. In English one also speaks of dial-tone availability (' dial tone availability'), since this high availability is achieved for landline telephony .
If the above formula is used to calculate the availability over a period of one year, an availability of 99.99% corresponds, for example, to a downtime of 52.6 minutes. The number of nines in the percentage is usually used to identify the availability class: the above example with 99.99% means availability class 4.
Given a given maximum downtime, the following is an overview of the relevant classes 2 to 6, whereby a year is calculated with an average of 365.25 days and the month as 1/12 year:
- Availability class 2
- 99% ≡ 438 minutes / month or 7:18:18 hours / month = 87.7 hours / year, i.e. H. 3 days and 15:39:36 h
- Availability class 3
- 99.9% ≡ 43:48 minutes / month or 8:45:58 hours / year
- Availability class 4
- 99.99% ≡ 4:23 minutes / month or 52:36 minutes / year
- Availability class 5
- 99.999% ≡ 26.3 seconds / month or 5:16 minutes / year
- Availability class 6
- 99.9999% ≡ 2.63 seconds / month or 31.6 seconds / year
The calculated availability with a total downtime of one day per year would be 99.73% (almost VK3), one hour 99.989% (practically VK4), one minute 99.99981% (almost VK6) and one second 99.9999968% (VK7). This corresponds pretty closely to the 3σ, 4σ, 5σ, and 6σ levels of the standard normal distribution .
Availability Environment Classification
The Harvard Research Group (HRG) divides high availability into six classes in its Availability Environment Classification (AEC).
|AEC-0||Conventional||Function can be interrupted, data integrity is not essential|
|AEC-1||Highly Reliable||Function can be interrupted, but data integrity must be guaranteed|
|AEC-2||High availability||Function may only be minimally interrupted within specified times or during main operating hours|
|AEC-3||Fault resilient||Function must be maintained without interruption within specified times or during main operating hours|
|AEC-4||Fault Tolerant||Function must be maintained without interruption, 24/7 operation (24 hours, 7 days a week) must be guaranteed|
|AEC-5||Disaster Tolerant||Function must be available under all circumstances|
Agreed period of availability
Many high-availability systems have to be online 24 hours * 7 days, in other words "around the clock" all year round. However, some of these systems only need to be highly available for a certain period of time: Deutsche Börse trading systems, for example, do not need to be highly available at night and on non-trading days. In these systems, high availability only relates to the time of day and / or the working days on which it is required.
Requirements for high availability
In general, HA systems strive to eliminate so-called single point of failure risks (SPOF) (a SPOF is a single component whose failure leads to the failure of the entire system).
A manufacturer of a high-availability system must equip it with the following features:
Typical examples of components that are used to achieve increased fault tolerance, are uninterruptible power supplies (UPS; engl. Uninterruptible power supply , UPS ), multiple power supplies, ECC -Speicher or the use of RAID systems. Techniques for server mirroring or redundant clusters are also used.
The higher the required availability, the more effort the operator has to invest in:
- quickly accessible specialist staff
- Spare parts availability
- preventive maintenance
- qualified error reporting and fast communication system
Highly specialized systems with the highest availability are for example
- the Continuum series from Stratus
- the Integrity NonStop series at HP , resulting from the acquisition of Tandem (1997) and the Digital Equipment Corporation (1998) on Compaq emerged
- generally mainframes, e.g. B. those of the System z series from IBM
- Telephone exchanges .
- Defect management
- Load distribution (computer science)
- Performance management
- Shared Risk Link Group (SRLG)
- Single point of failure
- Standby database
- Martin Wieczorek, Uwe Naujoks, Bob Bartlett (eds.): Business Continuity . Springer, 2003, ISBN 3-540-44285-5 .
- Marcus, Evan et al. Stern, Hal: Blueprints for High Availability: Designing Resilient Distributed Systems . John Wiley & Sons, 2000, ISBN 0-471-35601-8 .
- Floyd Piedad, Michael Hawkins: High Availability: Design, Techniques and Processes . Prentice Hall Ptr, 2000, ISBN 0-13-096288-0 .
- High Availability (HA). (No longer available online.) IEEE Task Force on Cluster Computing, archived from the original on July 14, 2010 ; accessed on October 26, 2010 (English). Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice.
- Andrea Held: Oracle 10g high availability . Addison-Wesley, 2004, ISBN 3-8273-2163-8 .
- Matthew Portnoy: Virtualization for Beginners . Wiley-VCH Verlag, Weinheim, 1st edition 2012. ISBN 978-3-527-76023-7 .
- HRG 2002, see also Andrea Held: High availability: key figures and metrics. (No longer available online.) In: TEC Channel. June 6, 2005, archived from the original on April 20, 2008 ; Retrieved October 26, 2010 . Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice.