IEEE 754
The IEEE 754 standard ( ANSI / IEEE Std 7541985; IEC60559: 1989  International version) defines standard representations for binary floating point numbers in computers and specifies precise procedures for performing mathematical operations, in particular for rounding. The exact name of the standard is English IEEE Standard for Binary FloatingPoint Arithmetic for microprocessor systems (ANSI / IEEE Std 7541985) .
The current edition was published under the name ANSI / IEEE Std 7542019 in July 2019. The IEEE 8541987 standard , with the English Title Standard for radixindependent floatingpoint arithmetic , fully integrated in IEEE 7542008 .
overview
The IEEE 7541989 standard defines two basic data formats for binary floating point numbers with 32 bit ( single precision ) or 64 bit ( double precision ) memory requirements and two extended formats. The IEEE 7542008 includes the binary number formats with 16 bit as minifloat , 32 bit as single , 64 bit as double and new 128 bit. In addition, the decimal representations with 32 bits as minifloat, 64 and 128 bits were added.
Finally, there were suggestions and implementations of other number formats that are designed according to the principles of the IEEE 7541989 standard and are therefore often referred to as IEEE numbers, although strictly speaking they are not according to the old definition. These include the minifloats integrated in the new editions, which are intended for training. 16bit minifloats are occasionally used in graphics programming. There are also several number formats not defined by IEEE 7541989 with more than 64 bits, such as the 80bit format ( Extended Precision ^{Layout ...} ), which the IA32 processors use internally in their classic floating point unit , FPU) to do this.
General
The representation of a floating point number
consists:
 Sign (1 bit)
 Mantissa ( bits)
 Basis (for normalized floating point numbers according to IEEE 754 is )
 Exponent ( bits)
The sign is stored in a bit , so it marks positive numbers and negative numbers.
The exponent is used as a nonnegative binary number ( sometimes also referred to as characteristic or exponent biased hereinafter) stored by the fixed bias value added: . The bias value (English: distortion) is calculated through . The bias value is used so that negative exponents can be stored by an unsigned number (the characteristic ), without alternative codings such as B. the two's complement (compare also excess code ).
After all, the mantissa is a value that is calculated from the mantissa bits with the value as . In simpler terms, one thinks at the Mantissenbitmuster left a "1," attached: .
This procedure is possible because the condition for all representable numbers can always be met by normalization (see below) . Since the mantissa then always starts with “1” on the left, this bit no longer needs to be saved. This gives you an additional bit of accuracy.
Two exponent values with special bit patterns are reserved for special cases, the maximum value ( ) and the minimum value ( ). The special cases NaN and ∞ are coded with the maximum exponent value. The floating point number 0 and all denormalized values are coded with zero in the exponent.
Values outside the normal range of values (too large or too small numbers) are represented by ∞ or −∞. This expansion of the range of values often allows useful further calculations even in the event of an arithmetic overflow. In addition to the number 0, there is also the value −0. While the result is ∞, the value is −∞. When comparing, no distinction is made between 0 and −0.
The values NaN (for “not a number”) are used to represent undefined values. You step z. B. on as results of operations such as or on. NaN are divided into signal NaN (signaling NaN, NaNs) for exceptional conditions and silent NaN (quiet NaN, NaNq).
As a last special case, denormalized numbers (referred to as subnormal numbers in IEEE 754r) fill the area between the smallest normalized floating point number in terms of absolute value and zero. They are stored as fixed point numbers and do not have the same precision as the normalized numbers. Due to the design, most of these values have the reciprocal value ∞.
Number formats and other specifications of the IEEE 754 standard
IEEE 754 distinguishes between four representations: single exact ( single ), extended single exact (single extended), double exact ( double ) and extended double exact (double extended) number formats. Only a minimum number of bits is required for the extended formats. The exact number of bits and the bias value are left to the implementer. The basic formats are fully defined.
The number of exponent bits in particular defines the maximum and minimum of the numbers that can be represented. The number of mantissa bits determines the ( relative see below ) accuracy of these numbers (and only to a small extent the maximum and minimum).
Type  Size (1 + r + p)  Exponent (r)  Mantissa (p)  Values of the exponent (e)  Bias value (B) 

single  32 bit  8 bit  23 bit  −126 ≤ e ≤ 127  127 
single extended  ≥ 43 bit  ≥ 11 bit  ≥ 31 bit 
e _{min} ≤ −1022 e _{max} ≥ 1023 
not specified 
double  64 bit  11 bit  52 bit  −1022 ≤ e ≤ 1023  1023 
double extended  ≥ 79 bit  ≥ 15 bit  ≥ 63 bit 
e _{min} ≤ −16382 e _{max} ≥ 16383 
not specified 
quadruple  128 bit  15 bit  112 bit  −16382 ≤ e ≤ 16383  16383 
The last two examples show a minimal extended format.
The following limitation of the respective number range results for the specified formats. The smallest numbers in terms of amount are not normalized. The relative distance between two floating point numbers is greater than and less than or equal to . The distance (and in this case also therelative distance) between the floating point number and the next larger floating point number . Decimal places describes the number of places of a decimal number that can be stored without loss of accuracy. The mantissa is mathematically one larger than stored due to the implicit bit.
Type  Decimal places 
Smallest number (in terms of amount) (normalized) 
Smallest number (in terms of amount) (denormalized) 
Biggest number  

single  2 ^{ (23 + 1)} ≈ 6.0 · 10 ^{−8} 
7… 8  2 ^{−126} ≈ 1.1 · 10 ^{−38} 
2 ^{−23} × 2 ^{−126} ≈ 1.4 · 10 ^{−45} 
(2−2 ^{−23} ) × 2 ^{127} ≈ 3.4 · 10 ^{38} 
single extended, minimum  2 ^{ (31 + 1)} ≈ 2.3 · 10 ^{−10} 
9… 10  2 ^{−1022} ≈ 2.2 · 10 ^{−308} 
2 ^{−31} × 2 ^{−1022} ≈ 1.0 · 10 ^{−317} 
(2−2 ^{−31} ) × 2 ^{1023} ≈ 1.8 · 10 ^{308} 
double  2 ^{ (52 + 1)} ≈ 1.1 · 10 ^{−16} 
15 ... 16  2 ^{−1022} ≈ 2.2 · 10 ^{−308} 
2 ^{−52} × 2 ^{−1022} ≈ 4.9 · 10 ^{−324} 
(2−2 ^{−52} ) × 2 ^{1023} ≈ 1.8 · 10 ^{308} 
double extended, minimum  2 ^{ (63 + 1)} ≈ 5.4 · 10 ^{−20} 
19… 20  2 ^{−16382} ≈ 3.4 · 10 ^{−4932} 
2 ^{−63} × 2 ^{−16382} ≈ 3.7 · 10 ^{−4951} 
(2−2 ^{−63} ) × 2 ^{16383} ≈ 1.2 · 10 ^{4932} 
The adjacent figure shows the arrangement of the bits of a single . The specific arrangement of the bits in the memory in a computer system can differ from this figure and depends on the respective byte sequence (little / big endian) and other computer characteristics.
The arrangement with the signed exponent mantissa in exactly this order brings the displayed floating point values (within a sign range) into the same sequence as the integer values that can be represented by the same bit pattern. This allows the same operations to be used for comparing floating point numbers as for comparing whole numbers. In short: the floating point numbers can be sorted lexically.
It should be noted, however, that for increasing negative integer values the corresponding floating point value tends to minus infinity, i.e. the sorting is reversed.
Examples
Calculation of decimal number → IEEE754 floating point number
The number is to be converted into a floating point number using the single IEEE standard.

Conversion of decimal to a dual fixedpoint unsigned number
so

Normalizing and determining the exponent
Excluding the highest power of two: The bias value for the exponent is composed of a zero and ones. The following applies to: The exponent of the power of two is thus stored with the bias B. The normalization of can also be achieved by shifting the decimal point in the binary system: The mantissa is and the exponent with bias .

Determine the sign bit
Here positive, so . 
Form the floating point number
The prepoint one of the mantissa is omitted as a hidden bit .
Calculation of IEEE754 floating point number → decimal number
Now the floating point number from above is to be converted back into a decimal number, so the following IEEE754 number is given

Calculating the exponent Converting the exponent to a decimal number Since the exponent value is plus the bias, the bias is subtracted: so is the exponent.

Calculating the mantissa
Since this is a normalized number, we know that it has a 1 in front of the decimal point: Now the decimal point has to be shifted 4 places to the right:

Conversion to decimal
digits before: Decimal places: In order to preserve the value of Nachkommazahl, you have the same process carried out as if integers, but in the opposite direction, ie from left to right. The exponent must be negative and start with a 1.

Determine
the sign The sign bit is a zero, so it is a positive number. 
Combine components to a decimal number
Interpretation of the number format
The interpretation depends on the exponent. For explanation, S is the value of the sign bit (0 or 1), E is the value of the exponent as a nonnegative integer between 0 and E _{max} = 11… 111 = 2 ^{r} −1, M is the value of the mantissa as a nonnegative number and with B denotes the bias value. The numbers r and p denote the number of exponent bits and mantissa bits.
Characteristic  Mantissa M  meaning  Casually  designation 

E = 0  M = 0  (−1) ^{S} × 0  ± 0  Zero (belongs to denorm.) 
E = 0  M > 0  (−1) ^{S} × M / 2 ^{p} × 2 ^{1− B}  ± 0, M × 2 ^{1− B}  denormalized number 
0 < E <2 ^{r} −1  M > = 0  (−1) ^{S} × (1+ M / 2 ^{p} ) × 2 ^{E  B}  ± 1, M × 2 ^{E  B}  normalized number 
E = 2 ^{r} −1  M = 0  Infinite  ± ∞  Infinite 
E = 2 ^{r} −1  M > 0  no number  no number (NaN) 
zero
Zero represents the signed zero . Numbers that are too small to be displayed (underflow) are rounded to zero. Their sign is retained. Negative small numbers are rounded to −0.0, positive numbers to +0.0. In a direct comparison, however, +0.0 and −0.0 are considered to be equal.
Normalized number
The mantissa consists of the first n essential digits of the binary representation of the not yet normalized number. The first significant digit is the most significant (i.e. leftmost) digit other than 0. Since a digit other than 0 can only be a 1 in the binary system, this first 1 does not have to be saved explicitly; In accordance with the IEEE 754 standard, only the following digits are saved; the first digit is an implicit digit or an implicit bit ( hidden bit ). This "saves" 1 bit of storage space.
Denormalized number
If a number is too small to be stored in normalized form with the smallest nonzero exponent, it is stored as a “denormalized number”. Your interpretation is no longer ± 1, mantissa · 2 ^{exponent} but ± 0, mantissa · 2 ^{de} . It is de the value of the smallest "normal" exponent. This can be used to fill the gap between the smallest normalized number and zero. However, denormalized numbers have a lower (relative) accuracy than normalized numbers; the number of significant digits in the mantissa decreases towards zero.
If the result (or intermediate result) of a calculation is smaller than the smallest representable number of the finite arithmetic used, it is generally rounded to zero; this is called underflow of floating point arithmetic. underflow . Since information is lost in the process, one tries to avoid underflow if possible. The denormalized numbers in IEEE 754 cause a gradual underflow (engl. Gradual underflow ) by "around the 0" 2 ^{24} (for single ) and 2 ^{53} (for double have to be inserted) values, which all have the same absolute distance from each other and without these denormalized values would not be representable, but would have to lead to underflow.
On the processor side, denormalized numbers are implemented with low priority due to their proportionally rare occurrence and therefore lead to a significant slowdown in execution as soon as they appear as an operand or as the result of a calculation. To remedy this (e.g. for computer games), Intel has been offering the nonIEEE 754 compliant functionality since SSE2 to completely deactivate denormalized numbers (MXCSR options “flush to zero” and “denormals are zero”). Floating point numbers that come into this range are rounded to 0.
Infinite
The floating point value infinite represents numbers whose magnitude is too large to be represented. A distinction is made between positive infinity and negative infinity. According to the definition of IEEE754, the calculation of 1.0 / 0.0 results in “positive infinity”.
No number ( NaN )
This shows invalid (or undefined) results, e.g. B. when trying to calculate the square root of a negative number. Some “indefinite expressions” result in “no number”, for example 0.0 / 0.0 or “infinite”  “infinite”. In addition, NaNs are used in various application areas to represent “no value” or “unknown value”. In particular, the value with the bit pattern 111 ... 111 is often used for an "uninitialized floating point number".
IEEE 754 requires two types of nonnumbers: silent NaN (NaNq  quiet ) and signaling NaN (NaNs  signaling ). Both explicitly do not represent numbers. In contrast to a silent NaN, a signaling NaN triggers an exception (trap) if it occurs as an operand of an arithmetic operation.
IEEE 754 enables the user to deactivate these traps. In this case, signaling NaN are treated like silent NaN.
Signaling NaN can be used to fill uninitialized computer memory, so that every use of an uninitialized variable automatically throws an exception.
Silent NaN make it possible to handle calculations that cannot produce a result, for example because they are not defined for the specified operands. Examples are division zero by zero or the logarithm of a negative number.
Silent and signaling NaN differ in the highest mantissa bit. With silent NaN this is 1, with signaling NaN 0. The remaining mantissa bits can contain additional information, e.g. B. the cause of NaN. This can be useful for exception handling. However, the standard does not stipulate what information is contained in the remaining mantissa bits. The evaluation of these bits is therefore platformdependent.
The sign bit has no meaning with NaN. It is not specified which value the sign bit has for the returned NaN.
Roundings
IEEE 754 differentiates between binary rounding and binarydecimal rounding, for which lower quality requirements apply.
Binary rounding must be rounded to the nearest representable number. If this is not clearly defined (exactly in the middle between two representable numbers), it is rounded so that the least significant bit of the mantissa becomes 0. Statistically, in 50% of the cases it is rounded up, in the other 50% of the cases, so that the statistical drift described by Knuth is avoided in longer calculations.
An implementation that conforms to IEEE 754 must provide three further roundings that can be set by the programmer: Rounding towards + infinity (always round up), rounding towards  infinite (always round down) and rounding towards 0 (always reduce the amount).
Operations
IEEE 754 compliant implementations must provide operations for arithmetic, calculation of the square root, conversions, and comparisons. Another group of operations is recommended in the appendix, but not compulsory.
Arithmetic and square root
IEEE 754 requires exactly rounded results from a (hardware or software) implementation for the operations addition, subtraction, multiplication and division of two operands as well as the operation square root of an operand. This means that the result determined must be the same as that which arises from an exact execution of the corresponding operation with subsequent rounding.
It is also necessary to calculate the remainder after division with an integer result. This remainder is defined by , as an integer, or even if it is even . This remainder must be determined exactly without rounding.
Conversions
Conversions are required between all supported floating point formats. When converting to a floating point format with less precision, it must be rounded off exactly as described under arithmetic.
IEEE 754 compliant implementations must provide conversions between all supported floating point formats and all supported integer formats. The IEEE 754 does not define the integer formats in more detail.
For every supported floating point format, there must be an operation that converts this floating point number into the exactly rounded whole number in the same floating point format.
Finally, there must be conversions between the binary floating point format and a decimal format that meet precisely described minimum quality requirements.
Comparisons
Floating point numbers according to IEEE 754 must be able to be compared. The standard defines the necessary comparison operations and the required results for all possible special cases (especially NaN, infinite and 0). Compared to the “school mathematical” comparisons (smaller, equal to or larger), a possible result according to IEEE 754 is especially unordered (“not classified”) if one of the comparison operands is NaN. Two NaN are fundamentally different, even if their bit patterns match.
Recommended operations
Ten additional operations are recommended in the appendix to the standard. Since they are basically needed in an implementation anyway, this recommendation ultimately boils down to passing the operations on to the programmer. These operations are (in C notation): copysign (x, y), invertsign (x), scalb (y, n), logb (x), nextafter (x, y), finite (x), isnan (x) , x ≠ y, unordered (x, y), class (x). The details of the implementation, especially again for the special cases NaN etc., are also suggested.
Exceptions, flags and traps
If exceptions occur during the calculation, status flags are set. The standard stipulates that the user can read and write these flags. The flags are "sticky": once they are set, they are retained until they are explicitly reset. For example, checking the flags is the only way to distinguish 1/0 (= infinity) from an overflow.
It is also recommended in the standard to enable trap handlers: If an exception occurs, the trap handler is called instead of setting the status flag. It is the responsibility of such trap handlers to set or delete the corresponding status flag.
Exceptions are divided into 5 categories in the standard: overflow, underflow, division by zero, invalid operation and inaccurate. A status flag is available for each class.
history
In the 1960s and early 1970s, each processor had its own format for floating point numbers and its own FPU, or floating point software, used to process that format. The same program could produce different results on different computers. The quality of the various floating point arithmetic was also very different.
Around 1976, Intel planned its own FPU for its microprocessors and wanted the best possible solution for the arithmetic to be implemented. In 1977, under the auspices of the IEEE, meetings began to standardize FPUs for floating point arithmetic for microprocessors. The second meeting took place in November 1977 in San Francisco , chaired by Richard Delp . One of the leading participants was William Kahan .
Around 1980, the number of proposals for the standard was reduced to two: The KCS proposal (according to its authors K ahan, C oonen and S tone, 1977) ultimately opposed the alternative from DEC (F format, D format and G format). A major milestone on the road to the norm was the discussion of how to deal with the underflow , which had been neglected by most programmers until then.
At the same time as the development of the standard, Intel largely implemented the standard proposals in the Intel FPU 8087 , which was used as a floating point coprocessor for the 8088. The first version of the standard was adopted in 1985 and expanded in 2008.
literature
 IEEE 754: reprinted in SIGPLAN Notices , Vol. 22, No. 2, Feb. 1987, pp. 925
 JeanMichel Muller: Elementary Functions  Algorithms and Implementation . 2nd Edition. Birkhäuser, Lyon 2006, ISBN 0817643729 .
Web links
 IEEE 7541985 (PDF, 89 KiB)
 IEEE Std 7542019 of July 22, 2019
 IEEE Std 7542008. (PDF) (No longer available online.) Archived from the original on November 6, 2016 ; accessed on May 30, 2017 (English). Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice.
 Online converter between binary and decimal representation of IEEE 754 floating point numbers
 Java applet for converting decimal → IEEE754 and IEEE754 → decimal number with explanations for Windows Internet Explorer
 The story: An Interview with the Old Man of FloatingPoint (Reminiscences elicited from William Kahan by Charles Severance)
 William Kahan: Lecture Notes on the Status of IEEE Standard 754 for Binary FloatingPointArithmetic , 1996
 David Goldberg: What Every Computer Scientist Should Know About FloatingPoint Arithmetic
Individual evidence
 ^ IEEE Standard for FloatingPoint Arithmetic . In: IEEE Std 7542019 (Revision of IEEE 7542008) . July 2019, p. 184 , doi : 10.1109 / IEEESTD.2019.8766229 ( ieee.org [accessed February 5, 2020]).
 ↑ IEEE 7542008: Standard for FloatingPoint Arithmetic, IEEE Standards Association, 2008, doi: 10.1109 / IEEESTD.2008.4610935
 ^ David Goldberg: What Every Computer Scientist Should Know About FloatingPoint Arithmetic . In: ACM Computing Surveys . 23, 1991, pp. 548. doi : 10.1145 / 103162.103163 . Retrieved September 2, 2010.
 ↑ Shawn Casey: x87 and SSE Floating Point Assists in IA32: FlushToZero (FTZ) and DenormalsAreZero (DAZ) . October 16, 2008. Retrieved September 3, 2010.