The reliability of a technical product or system is a property (behavioral characteristic) that indicates how reliably a function assigned to the product or system is fulfilled in a time interval. It is subject to a stochastic process and can be described qualitatively or quantitatively (by the probability of survival); it is not directly measurable.
A distinction must be made between these so-called deterministic properties ( features ) of a product, which can be measured directly (such as weight, dimensions, strength, color, electrical and thermal conductivity).
The characteristic of reliability is inherent in all technical products, which means that no technical product is free from the possibility of failure.
The reliability of a product can either be determined empirically , by determining the failure frequency , or analytically, by deriving the reliability values of the parts of the product. In the case of simple technical devices, the empirical approach is usually chosen. In the case of complex large-scale industrial plants, the proof of reliability with regard to dangerous conditions can usually only be carried out analytically.
History of reliability engineering
The development of military aircraft in the 1940s and 1950s was associated with high failure rates of the aircraft ( V1 cruise missiles in Germany, Minuteman system in the USA (VDI 4002), (MIL-HDBK-338)). The more extensive and complex a device, the more error-prone it was. There was therefore a need to develop methods with which the reliability of the devices could be increased. This initiated the development of reliability methods and gave rise to the discipline of reliability engineering .
In one of the first German-language reliability literature, Technical Reliability , Messerschmitt-Bölkow-Blohm, Springer Verlag, 1977 it says:
"Reliability is a property that can be estimated empirically or with the help of probability calculations using a statistic to be measured variable based on observed failure frequencies."
The need for suitable methods for determining the reliability of technical products was particularly evident in the aerospace industry and, with some time lag, in nuclear technology. The modeling of large-scale systems with the simple block diagram ( black box ) was no longer sufficient and required improved methods. In the American aerospace industry, the methods of fault tree , failure mode and fault effect ( FMEA ) and fault hazard analysis (e.g. Boeing - System Safety Documents) were used as early as the late 1960s .
In Germany, reliability technology received its essential methodological basis with the establishment of the VDI Technical Committee Reliability and Quality Control in 1964 (VDI 4001) and the DIN specialist working group, KT Reliability of Nuclear Plants . It is reflected in the VDI manual for technical reliability (VDI 4001) as well as in the DIN standards fault tree, event tree , failure type and fault effect analysis (DIN 25419, DIN 25424 and EN 60812). These standards were developed over several years of skilled labor and are still valid today. The use of the different types of analysis, however, varied greatly depending on the level of experience of the user. What was still missing was a holistic approach to the methods.
From this level of experience, the method of risk analysis was also developed, which also uses the methods of reliability engineering. With the first risk analysis for a large-scale plant, the so-called " Rasmussen Study " (WASH-1400) Reactor Safety Study, an Assessment of Accident Risk in US Commercial NPP , NUREG -75/014, 1975 was also the first time a holistic approach for the Event tree and fault tree analysis worked out. The incident sequences to be analyzed were so complex that they could not be represented directly in a fault tree model. The logical structures of the incident sequences could be represented much more clearly in event trees. The systems that are used to control the incident are integrated via the branch points in the event tree. They are then analyzed and displayed in the fault trees. The risk model of the overall system accordingly consists of a large number of interlinked event and fault trees, which in their entirety can only be analyzed and quality assured using the means of a computer program.
With the application of risk analysis ( probabilistic safety analysis ) was the need for a continuation of the reliability methods significantly, such as the Human Factor Analysis ( Human factor ) (VDI 4006), the analysis of the dependent failure of redundant components ( Common Cause Failure ~ Commonly caused failures , GVA or CCF) and the quantification of the uncertainties of the analysis results.
This level of development is also reflected in the new standard for failure mode and error effect analysis (FMEA) (2006) compared to DIN 25448 (1990). The following changes were made in the amendment note for the new standard:
“A) Consideration of failures with a common cause; b) inclusion of human influences; c) handling of software errors; d) Introduction of the concept of deficiency type effects and criticism; e) inclusion of methods widely used in the auto industry; f) Supplemented normative references and connections with other fault condition type analysis methods; g) added examples; h) Treatment of advantages and disadvantages of different FMEA methods. "
The methods and terms of reliability technology are now comprehensively described in national and international standards and risk standards and apply in principle to all technical products and systems (see also the section on areas of application of reliability technology ).
Due to their complexity and the low probability of failure, the reliability of large-scale systems - such as a chemical plant or nuclear power plant - can not be obtained from operational monitoring alone. For this purpose, analytical reliability models are used, such as the fault tree and event tree model , in which the failure structure of the overall system is mapped and calculated. The calculation of the reliability or the probability of failure of the overall system then takes place on the basis of the empirically obtained failure frequencies (failure rates) of the individual components of the system. The mathematical derivation of the reliability through the failure rate is shown in failure rate .
Carrying out complex reliability analyzes requires an experienced processing team, systematic planning of all necessary work steps, a suitable reliability database and reliability software.
This organizational task is called Dependability Management and is comprehensively described in VDI 4003 and IEC 60300. VDI 4003 also gives an overview of the large number of analytical methods used today for reliability analysis and determination.
Software reliability is defined as the probability of error-free software use over a specified period of time and under specified environmental conditions (according to ANSI91, MIL-HDBK-338B, Section 9.1).
Software is immaterial and is not subject to any wear and tear mechanism, as is the case with hardware. The error rate of the software is therefore independent of its age and the frequency of its use.
There are three different types of software errors:
- Incorrect requirement: Error in the software requirement that specifies the environmental conditions in which the software is used.
- Design error: Incorrect design in relation to the specified requirement.
- Program error: Incorrect programming in terms of compliance with the software design.
Software must always be implemented in hardware before it can be tested. When an error occurs, it is usually difficult to determine whether the error is due to the hardware, software or their interaction (Section 9-3).
Software errors, insofar as they have not already been identified and eliminated during the development tests, are hidden error mechanisms (latent errors, see Section 2.2) that only appear under certain system conditions. The frequency of the detection of latent errors increases with the frequency of different system applications, and the elimination of latent errors reduces the error rate of the software (corresponding to the early failures of hardware systems, cf.).
(Test procedure for checking the software, see software reliability )
The determination of reliable probabilistic data is of particular importance for the reliability analysis. They are gained from operating experience in the use of technical products by systematically evaluating the frequency and causes of failures of similar products.
Experience from this data collection shows that the failure behavior of technical products generally goes through three different phases over their lifespan. At the beginning of the use of the product, the so-called early failures occur more frequently, which are justified by initial design weaknesses and which are eliminated with increasing operational experience. This is followed by the so-called usability phase, which is characterized by a low and largely constant failure behavior. The failure probability of such a system is exponentially distributed . At the end of the service life, wear failures occur more frequently, which in turn lead to an increase in the failure frequency - up to and including the uselessness of the product. The course of the failure rate is characterized by the so-called "bathtub curve" ( device service life ) (VDI 4010, sheet 3). The Weibull distribution is used to model this aging process .
MTBF (mean time between failures) is also a measure of the reliability of units (assemblies, devices or systems) that can be repaired. In the event that the failure rate is constant (the reliability variable is exponentially distributed; there are only random failures), the MTBF is obtained from the reciprocal of the failure rate. The latter also applies to the reliability specification MTTF (mean time to failure), which is used for non-repairable units.
The systematic collection of reliability data from operational experience is usually time-consuming, cost-intensive and necessary over long periods of time. The provision of qualified reliability data not only requires an experienced team of reliability experts, but also the - not always self-evident - cooperation of experienced production engineers who are necessary for a qualified assessment of the failure causes observed. Generally accessible reliability databases were therefore - in comparison to the reliability methods - only available at a much later point in time (cf.).
The unavailability is used in the reliability model (e.g. the fault tree ) for so-called stand-by components that should come into operation when required (e.g. the emergency diesel in the event of a power failure, the fire alarm and the fire extinguisher pump in the event of a fire) ( see Chapter 18.104.22.168, Chapters 2.1.4 and 6.3). In the stand-by phase, the passive (non-self-reporting) failure is generally assumed for these components and assessed with a corresponding failure rate. The unavailability as a probabilistic variable is then determined from the product of the failure rate (assumption: λ is constant and ≪1) and the time until the next functional test of the component. The test interval of the component is therefore included linearly in the unavailability of the component. In addition, the unavailability is determined by the repair time in the event of failure of the component as an additional part of the stand-by unavailability (from the product of the failure rate and repair time).
In the case of redundant technical equipment and components that are of the same type, there is basically the possibility that both units could fail due to a common failure mechanism, which is referred to as "jointly caused failure" (GVA). As part of the Probabilistic Safety Analyzes (PSA) for nuclear power plants, extensive national and international method developments for the analysis and data acquisition of GVA have been carried out (see Chapter 3.3, Appendix A).
Determination of service life according to Arrhenius
One method for determining the service life / failure rate is the accelerated aging process according to the Arrhenius or Eyring method, which is often used by component manufacturers for small component populations. The method (see Highly Accelerated Life Test and End of Life Tests ) is defined in various standards:
- ISO Standard 18921: 2008, “Imaging materials - Compact discs (CD-ROM) - Method for estimating the life expectancy based on the effects of temperature and relative humidity”.
- Standard ECMA-379 (identical to ISO / IEC 10995: 2008), “Test Method for the Estimation of the Archival Lifetime of Optical Media”.
- USA - National Institute of Standards and Technology (NIST): "Optical Media Longevity Study".
However, it is known from field tests that the results from these laboratory tests often underestimate the real error rate, since not all possible error mechanisms can be thought in advance and simulated in the laboratory. In a large field test for computer hard drives, annual error rates between approx. 2 to 9% were determined, whereas the manufacturer's information was below 2%.
In information technology, the determination of the lifespan of digital data carriers (such as hard drives, USB sticks, CD, DVD, magnetic tapes and floppy disks) is becoming increasingly important for long-term archiving of digital information. Due to the very different technologies of the data carriers, they have different failure mechanisms and accordingly also different lifetimes (see Wikipedia information technology ).
Definitions of terms
The term reliability ( Reliability / Dependability) has different in the standard works two meanings. On the one hand, it is viewed as a superordinate feature that includes other features and, on the other hand, as a stand-alone feature (see definitions below). The partly different definitions in the German and English-speaking areas also make it clear that the process of defining terms for reliability technology is not yet complete.
- Summarizing expression for functional reliability, availability , safety , maintainability . (VDI 4003 - Reliability Management, 2005-07)
- "Condition of a unit with regard to its suitability to meet the reliability requirement during or after specified periods of time under specified application conditions." (DIN 40041: 1990-12)
- "Collective term used to describe the availability performance and its influencing factors: reliability performance, maintainability performance and maintenance support performance." (IEC 60050, 191-02-06)
- Abbreviation for Reliability, Availability, Maintainability, Safety
The term RAMS has established itself in various branches of industry, for example in EN 50126: Railway applications - specification and proof of reliability, availability, maintainability, safety (RAMS); German version: 1999
- Functional reliability
- Ability of an observation unit to fulfill a required function under given conditions for a given time interval. The functional reliability can on the one hand be described qualitatively or on the other hand it can be determined quantitatively as a survival probability. (VDI 4003)
- "The ability of an item to perform a required function under given conditions for a given time interval." (IEC 60050, 191-02-06)
- "The probability that an item can perform a required function under given conditions for a given time interval." (IEC 50, 1992)
- "The capability of the software product to maintain a specified level of performance when used under specified conditions." (IEC 9126-1, 2001)
- Ability of a unit to be able to perform a required function under given conditions at a given point in time or during a given time interval, provided that the necessary external aids are provided. (IEV 191-02-05)
- Unit of consideration
- The observation unit (including unit ) is the subject of reliability tests, it can be part of a product or the entire product. It has to be defined. (VDI 4003)
- The term product is understood to mean clearly described, deliverable devices, systems, procedures, processes, facilities and services composed of hardware and / or software components and understood as a separate unit (unit under consideration). (VDI 4003)
The terms product, unit under consideration and system are understood to be synonymous in the sense of the definitions given here.
Objectives of reliability management
- Evidence of a low probability of failure of the product
- Optimization of the reliability, availability, maintenance and safety of the product over its entire life cycle
- System improvement by comparing alternative system designs using reliability assessment
- Detection of critical components ( weak point analysis )
- Optimization of the maintenance processes
- Obtaining planning values for the use of the product under economic and risk aspects
- Definition of the reliability goals - comparison of the target values with data from operational monitoring
- Guarantee, warranty, product liability.
- Build a knowledge base about the reliability features of the product.
- Obtaining key figures for the quantitative evaluation of the quality, protection and electrical safety of electrotechnical systems and devices from the point of view of occupational safety
Measures to increase reliability
- Use of well-proven and qualified components
- Use of redundant and diverse components
- Self-detection measures
- Application of the " fail-safe " principle
- Verifiability of the components and system complexes
- Qualification of the maintenance of the components
- Ergonomic design of the usability of the components
- Evaluation of the experience feedback to improve the reliability database, which also provides information about the effectiveness of the reliability management.
Areas of application and regulations for reliability technology
The application of reliability technology in the various industrial areas is reflected to a large extent in the industry-specific regulations, which are listed below - without claiming to be exhaustive.
- FAA : System Safety Handbook, December 2000
- NASA : Fault Tree Handbook with Aerospace Applications, office of safety and mission assurance, W. Vesely et al., Version 1.1, August 2002
- MIL-HDBK-338B: Electronic Reliability Design Handbook (10-1998)
- EUROCONTROL : Review of techniques to support the EATMP safety assessment methodology, Volume 1, 01/2004
- NUREG-0492: Fault Tree Handbook , WE Vesely, FF Goldberg, NH Roberts, DF Haasl, 1981
- NUREG / CR-2300: PRA Procedures Guide: A Guide to the Performance of Probabilistic Risk Assessments for Nuclear Power
- Development and Application of Level 1 Probabilistic Safety Assessment for Nuclear Power Plants , Specific Safety Guide Series No. SSG-3, April 27, 2010
- Development and Application of Level 2 Probabilistic Safety Assessment for Nuclear Power Plants , Specific Safety Guide Series No. SSG-4, May 25, 2010
In the automotive industry, the FMEA (Failure Mode and Effects Analysis) is used (internationally) in particular in the design and development phase of new products or processes and is also required by suppliers of series parts for the automotive manufacturers (see FMEA ).
- QS-9000: FMEA - Failure Mode and Effects Analysis
- Central Association of the Electrical and Electronics Industry ( ZVEI ): Handbook for Robustness Validation of Semiconductor Devices in Automotive Applications , 04/2007
- SEA: The New - J1879 - Robustness Validation Standard - A New Approach for Optimum Performance Levels ,
- Robustness Validation
Chemical, oil & gas industry
- Health and Safety Executive : Application of QRA in operational safety issues , 2002
- NORSOK STANDARD Z-013: Risk and emergency preparedness analysis , September 1, 2001
- American Petroleum Institute : API-Publication 581, Base Resource Document - Risk-Based Inspection
- OREDA : Offshore Reliability Data Handbook. 2002.
- EN 50126-2: Railway applications - Specification and evidence of reliability, availability , maintainability , safety ( RAMS ); German version: 1999
- The Yellow Book: Engineering Safety Management Published by Rail Safety and Standards Board on behalf of the UK rail industry.
Electrical energy and device technology (electrical safety)
- Siegfried Altmann (Ed.): Electrical safety and reliability. Scientific reports TH Leipzig 1985, issue 13; 1988, No. 9; 1989, issue 16 (see).
- Siegfried Altmann (Ed.): Electrical safety and reliability. ELEKTRIE, Berlin, 1980, issue 4; 1982, No. 6 and 1985, No. 9 (see).
- Siegfried Altmann : The tolerance limits - reliability of electrical energy systems as a decision-making aid for the protection quality assessment. ELEKTRIE, Berlin 31, 1977, issue 3, pp. 126-138.
- Siegfried Altmann: Application of the reliability theory in the quantitative evaluation of maintenance-compatible constructions in high-voltage plant construction under the aspect of occupational safety. Der Elektro-Praktiker, Berlin 31, 1977, issue 4, pp. 111-120.
- Siegfried Altmann: Electrical safety and reliability. Scientific reports of the TH Leipzig, 1985, issue 13, 88 pages, ISSN 0138-3809.
- P. Bitter: Technical reliability: problems, principles, research methods. Published by Messerschmitt-Bölkow-Blohm, Springer, 1971, digitized February 27, 2008, ISBN 978-3-540-05421-4 .
- David J. Smith: RELIABILITY, MAINTAINABILITY AND RISK: Practical Methods for Engineers. 6th edition. Butterworth-Heinemann, 2000. 
- Marko Čepin: Assessment of Power System Reliability: Methods and Applications. Springer, 2011. 
- DIN 25424-1: Fault Tree Analysis ; Method and symbols. Beuth Verlag, 1981-09
- DIN EN 62502: Procedure for the analysis of reliability - Event tree analysis (ETA). (IEC 62502: 2010), Beuth Verlag
- DIN EN 60812: 2006-11: Analysis techniques for the functionality of systems - Methods for the type and effects of defects analysis (FMEA). (IEC 60812: 2006), Beuth Verlag
- VDI 4001: General information on the VDI manual for technical reliability. (1985-10)
- VDI 4002: Basics of system technology; Explanations of the problem of the reliability of technical products and / or systems. (1986-07)
- VDI 4003: Reliability Management. (2005-07)
- VDI 4004: reliability parameters; Overview. (1986-09)
- VDI 4006: Human Reliability; ergonomic requirements and methods. (2002)
- VDI 4010: Overview of reliability data systems. (ZDS) (1997-03)
- IEC 60300-1: Dependability management systems. (2003)
- IEC 60300-2: Guidelines for dependability management. (2004)
- EN 61709: Electrical components - Reliability - Reference conditions for failure rates and stress models for conversion. (IEC 61709: 2011)
- IEC 9126-1: Software engineering - Product quality - Part 1: Quality model. (2001)
- BfS-KT: Methods for probabilistic safety analysis for nuclear power plants. (1996)
- SN 29500: failure rate, component, expected value; globally recognized in-house standard of Siemens AG. (2005)
- ZEDB: Central reliability and event database . VGB-TW804 (2004)
- OREDA: Offshore Reliability Data Handbook. (2002)
- T-Book: Reliability Data of Components in Nordic Nuclear Power Plants.
- MIL-HDBK-217F: Reliability Prediction of Electronic Equipment. (1991)
- MIL-HDBK-338: Electronic Reliability Design Handbook. (1998)
- everyspec.com , MIL-HDBK-338B, ELECTRONIC RELIABILITY DESIGN HANDBOOK.
- nrc.gov , WASH-1400: "Reactor Safety Study, an Assessment of Accident Risk in US Commercial NPP".
- beuth.de , DIN EN 60812: 2006-11.
- cse.cuhk.edu.hk , Handbook of Software Reliability Engineering, IEEE Computer Society Press and McGraw-Hill Book Company.
- ece.cmu.edu , J. Pan, Software Reliability, Dependable Embedded Systems, Carnegie Mellon University, Spring 1999.
- R. Dunn, "Software Defect Removal," McGraw-Hill, 1984.
- vgb.org (PDF; 52 kB), central reliability and event database .
- ebook3000.com , OREDA, Offshore Reliability Data Handbook, 2002.
- stralsakerhetsmyndigheten.se (PDF; 772 kB), Reliability Data Handbook for Piping Components in Nordic Nuclear Power Plants - R-Book, Phase 2, 2011-06.
- doris.bfs.de (PDF; 2.9 MB), BfS: Methods for probabilistic safety analysis for nuclear power plants.
- Data for the quantification of event flowcharts and fault trees, March 1997, BfS-KT-18/97.
- VDI / VDE 3542 sheet 3, Safety-related terms for automation systems - application notes and examples, 2000-10.
- static.googleusercontent.com (PDF; 247 kB), E. Pinheiro, W.Weber, L.Barroso, “Failure Trends in a Large Disk Drive Population”, Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST ' 07), February 2007.
- american-buddha.com , FAA System Safety Handbook, December 2000.
- elibrary.gsfc.nasa.gov (PDF; 1 MB), NASA: Fault Tree Handbook with Aerospace Application.
- nrc.gov Fault Tree Handbook , WE Vesely, FF Goldberg, NH Roberts, DF Haasl, 1981, NUREG-0492.
- nrc.gov , PRA Procedures Guide: A Guide to the Performance of Probabilistic Risk Assessments for Nuclear Power Plants , NUREG / CR-2300.
- www-pub.iaea.org (PDF; 1.8 MB), IAEA: "Development and Application of Level 1 Probabilistic Safety Assessment for Nuclear Power Plants".
- www-pub.iaea.org (PDF; 1.1 MB), IAEA: "Development and Application of Level 2 Probabilistic Safety Assessment for Nuclear Power Plants".
- qz-online.de , FMEA - Failure Mode and Influence Analysis.
- sae.org , SEA: The New - J1879 - Robustness Validation Standard .
- standard.no (PDF; 716 kB), NORSOK STANDARD Z-013 Risk and emergency preparedness analysis .
- api.org , API Publication 581, Base Resource Document - Risk-Based Inspection.
- Engineering Safety Management (The Yellow Book), Volumes 1 and 2, Fundamentals and Guidance, Issue 4 , Rail Safety and Standards Board on behalf of the UK rail industry, 2007, ISBN 978-0-9551435-2-6 .
- S. Altmann: Electrical safety - electrical railways and systems
- T-book: Reliability data of components in Nordic nuclear power plants , 7th ed. TUD Office, 2010, ISBN 9789163361449 .
- Weibull.com (PDF; 15.6 MB), MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, 1991.
- Weibull.com (PDF; 4.8 MB), MIL-HDBK-338, Electronic Reliability Design Handbook, 1998.