IEEE 754

The IEEE 754 standard ( ANSI / IEEE Std 754-1985; IEC-60559: 1989 - International version) defines standard representations for binary floating point numbers in computers and specifies precise procedures for performing mathematical operations, in particular for rounding. The exact name of the standard is English IEEE Standard for Binary Floating-Point Arithmetic for microprocessor systems (ANSI / IEEE Std 754-1985) .

The current edition was published under the name ANSI / IEEE Std 754-2019 in July 2019. The IEEE 854-1987 standard , with the English Title Standard for radix-independent floating-point arithmetic , fully integrated in IEEE 754-2008 .

overview

The IEEE 754-1989 standard defines two basic data formats for binary floating point numbers with 32 bit ( single precision ) or 64 bit ( double precision ) memory requirements and two extended formats. The IEEE 754-2008 includes the binary number formats with 16 bit as minifloat , 32 bit as single , 64 bit as double and new 128 bit. In addition, the decimal representations with 32 bits as minifloat, 64 and 128 bits were added.

Finally, there were suggestions and implementations of other number formats that are designed according to the principles of the IEEE 754-1989 standard and are therefore often referred to as IEEE numbers, although strictly speaking they are not according to the old definition. These include the mini-floats integrated in the new editions, which are intended for training. 16-bit minifloats are occasionally used in graphics programming. There are also several number formats not defined by IEEE 754-1989 with more than 64 bits, such as the 80-bit format ( Extended Precision Layout ... ), which the IA-32 processors use internally in their classic floating point unit , FPU) to do this.

General

The representation of a floating point number

${\ displaystyle x = s \ cdot m \ cdot b ^ {e}}$ consists:

• Sign (1 bit)${\ displaystyle s}$ • Mantissa ( bits)${\ displaystyle m}$ ${\ displaystyle p}$ • Basis (for normalized floating point numbers according to IEEE 754 is )${\ displaystyle b}$ ${\ displaystyle b = 2}$ • Exponent ( bits)${\ displaystyle e}$ ${\ displaystyle r}$ The sign is stored in a bit , so it marks positive numbers and negative numbers. ${\ displaystyle s = (- 1) ^ {S}}$ ${\ displaystyle S}$ ${\ displaystyle S = 0}$ ${\ displaystyle S = 1}$ The exponent is used as a non-negative binary number ( sometimes also referred to as characteristic or exponent biased hereinafter) stored by the fixed bias value added: . The bias value (English: distortion) is calculated through . The bias value is used so that negative exponents can be stored by an unsigned number (the characteristic ), without alternative codings such as B. the two's complement (compare also excess code ). ${\ displaystyle e}$ ${\ displaystyle E}$ ${\ displaystyle E}$ ${\ displaystyle B}$ ${\ displaystyle E = e + B}$ ${\ displaystyle 2 ^ {r-1} -1}$ ${\ displaystyle B}$ ${\ displaystyle E}$ After all, the mantissa is a value that is calculated from the mantissa bits with the value as . In simpler terms, one thinks at the Mantissenbitmuster left a "1," attached: . ${\ displaystyle 1 \ leq m <2}$ ${\ displaystyle p}$ ${\ displaystyle M}$ ${\ displaystyle m = 1 + M / 2 ^ {p}}$ ${\ displaystyle M}$ ${\ displaystyle m = 1 {,} M}$ • ${\ displaystyle s = (- 1) ^ {S}}$ • ${\ displaystyle e = EB}$ • ${\ displaystyle m = 1, M = 1 + M / 2 ^ {p}}$ This procedure is possible because the condition for all representable numbers can always be met by normalization (see below) . Since the mantissa then always starts with “1” on the left, this bit no longer needs to be saved. This gives you an additional bit of accuracy. ${\ displaystyle 1 \ leq m <2}$ Two exponent values ​​with special bit patterns are reserved for special cases, the maximum value ( ) and the minimum value ( ). The special cases NaN and ∞ are coded with the maximum exponent value. The floating point number 0 and all denormalized values ​​are coded with zero in the exponent. ${\ displaystyle E = 11 \ dots 111_ {2} = 2 ^ {r} -1}$ ${\ displaystyle E = 00 \ dots 000_ {2} = 0}$ Values ​​outside the normal range of values ​​(too large or too small numbers) are represented by ∞ or −∞. This expansion of the range of values ​​often allows useful further calculations even in the event of an arithmetic overflow. In addition to the number 0, there is also the value −0. While the result is ∞, the value is −∞. When comparing, no distinction is made between 0 and −0. ${\ displaystyle {\ tfrac {1} {0}}}$ ${\ displaystyle {\ tfrac {1} {- 0}}}$ The values NaN (for “not a number”) are used to represent undefined values. You step z. B. on as results of operations such as or on. NaN are divided into signal NaN (signaling NaN, NaNs) for exceptional conditions and silent NaN (quiet NaN, NaNq). ${\ displaystyle {\ tfrac {0} {0}}}$ ${\ displaystyle \ infty - \ infty}$ As a last special case, denormalized numbers (referred to as subnormal numbers in IEEE 754r) fill the area between the smallest normalized floating point number in terms of absolute value and zero. They are stored as fixed point numbers and do not have the same precision as the normalized numbers. Due to the design, most of these values ​​have the reciprocal value ∞.

Number formats and other specifications of the IEEE 754 standard

IEEE 754 distinguishes between four representations: single exact ( single ), extended single exact (single extended), double exact ( double ) and extended double exact (double extended) number formats. Only a minimum number of bits is required for the extended formats. The exact number of bits and the bias value are left to the implementer. The basic formats are fully defined.

The number of exponent bits in particular defines the maximum and minimum of the numbers that can be represented. The number of mantissa bits determines the ( relative see below ) accuracy of these numbers (and only to a small extent the maximum and minimum).

Type Size (1 + r + p) Exponent (r) Mantissa (p) Values ​​of the exponent (e) Bias value (B)
single 32 bit 8 bit 23 bit −126 ≤ e ≤ 127 127
single extended ≥ 43 bit ≥ 11 bit ≥ 31 bit e min ≤ −1022
e max ≥ 1023
not
specified
double 64 bit 11 bit 52 bit −1022 ≤ e ≤ 1023 1023
double extended ≥ 79 bit ≥ 15 bit ≥ 63 bit e min ≤ −16382
e max ≥ 16383
not
specified
quadruple 128 bit 15 bit 112 bit −16382 ≤ e ≤ 16383 16383

The last two examples show a minimal extended format.

The following limitation of the respective number range results for the specified formats. The smallest numbers in terms of amount are not normalized. The relative distance between two floating point numbers is greater than and less than or equal to . The distance (and in this case also the${\ displaystyle \ epsilon}$ ${\ displaystyle 2 \ epsilon}$ relative distance) between the floating point number and the next larger floating point number . Decimal places describes the number of places of a decimal number that can be stored without loss of accuracy. The mantissa is mathematically one larger than stored due to the implicit bit. ${\ displaystyle 1}$ ${\ displaystyle 2 \ epsilon}$ Type ${\ displaystyle \ epsilon}$ Decimal
places

Smallest number (in terms of amount)
(normalized)

Smallest number
(in terms of amount) (denormalized)
Biggest number
single 2 - (23 + 1)
≈ 6.0 · 10 −8
7… 8 2 −126
≈ 1.1 · 10 −38
2 −23 × 2 −126
≈ 1.4 · 10 −45
(2−2 −23 ) × 2 127
≈ 3.4 · 10 38
single extended, minimum 2 - (31 + 1)
≈ 2.3 · 10 −10
9… 10 2 −1022
≈ 2.2 · 10 −308
2 −31 × 2 −1022
≈ 1.0 · 10 −317
(2−2 −31 ) × 2 1023
≈ 1.8 · 10 308
double 2 - (52 + 1)
≈ 1.1 · 10 −16
15 ... 16 2 −1022
≈ 2.2 · 10 −308
2 −52 × 2 −1022
≈ 4.9 · 10 −324
(2−2 −52 ) × 2 1023
≈ 1.8 · 10 308
double extended, minimum 2 - (63 + 1)
≈ 5.4 · 10 −20
19… 20 2 −16382
≈ 3.4 · 10 −4932
2 −63 × 2 −16382
≈ 3.7 · 10 −4951
(2−2 −63 ) × 2 16383
≈ 1.2 · 10 4932

The adjacent figure shows the arrangement of the bits of a single . The specific arrangement of the bits in the memory in a computer system can differ from this figure and depends on the respective byte sequence (little / big endian) and other computer characteristics.

The arrangement with the signed exponent mantissa in exactly this order brings the displayed floating point values ​​(within a sign range) into the same sequence as the integer values ​​that can be represented by the same bit pattern. This allows the same operations to be used for comparing floating point numbers as for comparing whole numbers. In short: the floating point numbers can be sorted lexically.

It should be noted, however, that for increasing negative integer values ​​the corresponding floating point value tends to minus infinity, i.e. the sorting is reversed.

Examples

Calculation of decimal number → IEEE754 floating point number

The number is to be converted into a floating point number using the single IEEE standard. ${\ textstyle 18 {,} 4}$ 1. Conversion of decimal to a dual fixed-point unsigned number so
${\ displaystyle {\ begin {array} {ccc} 18 \ div 2 = 9 & {\ text {Remainder 0}} & {\ text {(Least-Significant Bit)}} \\ 9 \ div 2 = 4 & {\ text {Remainder 1}} & \\ 4 \ div 2 = 2 & {\ text {remainder 0}} & \\ 2 \ div 2 = 1 & {\ text {remainder 0}} & \\ 1 \ div 2 = 0 & {\ text {Remainder 1}} & {\ text {(Most-Significant Bit)}} \\ & 18 = 10010_ {2} \ end {array}}}$ ${\ displaystyle {\ begin {array} {ccc} 0 {,} 4 \ cdot 2 = 0 {,} 8 & -0 & {\ text {(Most-Significant Bit)}} \\ 0 {,} 8 \ cdot 2 = 1 {,} 6 & -1 & \\ 0 {,} 6 \ cdot 2 = 1 {,} 2 & -1 & \\ 0 {,} 2 \ cdot 2 = 0 {,} 4 & -0 & \\ 0 {,} 4 \ cdot 2 = 0 {,} 8 & -0 & \\ 0 {,} 8 \ cdot 2 = 1 {,} 6 & -1 & {\ text {(Least-Significant Bit)}} \\\ ldots && \\ & 0 {,} 4 = 0 {,} 011001 \ ldots _ {2} \ end {array}}}$ ${\ displaystyle 18 {,} 4 = 10010 {,} 011001 \ ldots _ {2} = (1 \ cdot 2 ^ {4} +0 \ cdot 2 ^ {3} +0 \ cdot 2 ^ {2} +1 \ cdot 2 ^ {1} +0 \ cdot 2 ^ {0}) + (0 \ cdot 2 ^ {- 1} +1 \ cdot 2 ^ {- 2} +1 \ cdot 2 ^ {- 3} +0 \ cdot 2 ^ {- 4} +0 \ cdot 2 ^ {- 5} +1 \ cdot 2 ^ {- 6} + \ ldots)}$ 2. Normalizing and determining the exponent
Excluding the highest power of two: The bias value for the exponent is composed of a zero and ones. The following applies to: The exponent of the power of two is thus stored with the bias B. The normalization of can also be achieved by shifting the decimal point in the binary system: The mantissa is and the exponent with bias . ${\ textstyle (1 \ cdot 2 ^ {0} +0 \ cdot 2 ^ {- 1} +0 \ cdot 2 ^ {- 2} +1 \ cdot 2 ^ {- 3} +0 \ cdot 2 ^ {- 4} +0 \ cdot 2 ^ {- 5} +1 \ cdot 2 ^ {- 6} +1 \ cdot 2 ^ {- 7} +0 \ cdot 2 ^ {- 8} +0 \ cdot 2 ^ {- 9} +1 \ cdot 2 ^ {- 10} + \ ldots) \ cdot 2 ^ {4}}$ ${\ textstyle r-1}$ ${\ textstyle r = 8}$ ${\ textstyle B = 01111111_ {2} = 127_ {10}}$ ${\ textstyle 2 ^ {4}}$ ${\ textstyle 4 + 127 = 131 = 10000011_ {2}}$ ${\ displaystyle 10010,011001 \ ldots _ {2}}$ {\ displaystyle {\ begin {aligned} 10010 {,} 011001 \ ldots \ cdot 2 ^ {01111111-01111111} \\ = 1001 {,} 0011001 \ ldots \ cdot 2 ^ {10000000-01111111} \\ = 100 {, } 10011001 \ ldots \ cdot 2 ^ {10000001-01111111} \\ = 10 {,} 010011001 \ ldots \ cdot 2 ^ {10000010-01111111} \\ = 1 {,} 0010011001 \ ldots \ cdot 2 ^ {10000011-01111111 } \ end {aligned}}} ${\ textstyle 1 {,} 0010011001 \ ldots}$ ${\ textstyle 10000011}$ 3. Determine the sign bit
Here positive, so . ${\ displaystyle 0}$ 4. Form the floating point number The pre-point one of the mantissa is omitted as a hidden bit .
${\ displaystyle {\ begin {array} {ccc} {\ text {1 bit sign}} & {\ text {8 bit exponent}} & {\ text {23 bit mantissa}} \\ 0 & 10000011 & 00100110011001100110011 \ end {array}} }$ Calculation of IEEE754 floating point number → decimal number

Now the floating point number from above is to be converted back into a decimal number, so the following IEEE754 number is given
${\ displaystyle 0 \ 10000011 \ 00100110011001100110011}$ 1. Calculating the exponent Converting the exponent to a decimal number Since the exponent value is plus the bias, the bias is subtracted: so is the exponent.
${\ displaystyle 10000011_ {2} = 131_ {10}}$ ${\ displaystyle 131-127 = 4}$ 2. Calculating the mantissa
Since this is a normalized number, we know that it has a 1 in front of the decimal point: Now the decimal point has to be shifted 4 places to the right:${\ displaystyle 1 {,} 00100110011001100110011}$ ${\ displaystyle 10010 {,} 0110011001100110011}$ 3. Conversion to decimal
digits before: Decimal places: In order to preserve the value of Nachkommazahl, you have the same process carried out as if integers, but in the opposite direction, ie from left to right. The exponent must be negative and start with a 1.${\ displaystyle 10010_ {2} = 18_ {10}}$ ${\ displaystyle 0 {,} 0110011001100110011_ {2} \ approx 0.39999961853_ {10}}$ {\ displaystyle {\ begin {aligned} 0 \ cdot 2 ^ {- 1} \\ + 1 \ cdot 2 ^ {- 2} \\ + 1 \ cdot 2 ^ {- 3} \\ + 0 \ cdot 2 ^ {-4} \\ + 0 \ cdot 2 ^ {- 5} \\ + 1 \ cdot 2 ^ {- 6} \\\ ldots \ end {aligned}}} 4. Determine
the sign The sign bit is a zero, so it is a positive number.
5. Combine components to a decimal number
${\ displaystyle 18 {,} 39999961853}$ Interpretation of the number format

The interpretation depends on the exponent. For explanation, S is the value of the sign bit (0 or 1), E is the value of the exponent as a nonnegative integer between 0 and E max = 11… 111 = 2 r −1, M is the value of the mantissa as a nonnegative number and with B denotes the bias value. The numbers r and p denote the number of exponent bits and mantissa bits.

Characteristic Mantissa M meaning Casually designation
E = 0 M = 0 (−1) S × 0 ± 0 Zero (belongs to denorm.)
E = 0 M > 0 (−1) S × M / 2 p × 2 1− B ± 0, M × 2 1− B denormalized number
0 < E <2 r −1 M > = 0 (−1) S × (1+ M / 2 p ) × 2 E - B ± 1, M × 2 E - B normalized number
E = 2 r −1 M = 0 Infinite ± ∞ Infinite
E = 2 r −1 M > 0 no number no number (NaN)

zero

Zero represents the signed zero . Numbers that are too small to be displayed (underflow) are rounded to zero. Their sign is retained. Negative small numbers are rounded to −0.0, positive numbers to +0.0. In a direct comparison, however, +0.0 and −0.0 are considered to be equal.

Normalized number

The mantissa consists of the first n essential digits of the binary representation of the not yet normalized number. The first significant digit is the most significant (i.e. leftmost) digit other than 0. Since a digit other than 0 can only be a 1 in the binary system, this first 1 does not have to be saved explicitly; In accordance with the IEEE 754 standard, only the following digits are saved; the first digit is an implicit digit or an implicit bit ( hidden bit ). This "saves" 1 bit of storage space.

Denormalized number

If a number is too small to be stored in normalized form with the smallest non-zero exponent, it is stored as a “denormalized number”. Your interpretation is no longer ± 1, mantissa · 2 exponent but ± 0, mantissa · 2 de . It is de the value of the smallest "normal" exponent. This can be used to fill the gap between the smallest normalized number and zero. However, denormalized numbers have a lower (relative) accuracy than normalized numbers; the number of significant digits in the mantissa decreases towards zero.

If the result (or intermediate result) of a calculation is smaller than the smallest representable number of the finite arithmetic used, it is generally rounded to zero; this is called underflow of floating point arithmetic. underflow . Since information is lost in the process, one tries to avoid underflow if possible. The denormalized numbers in IEEE 754 cause a gradual underflow (engl. Gradual underflow ) by "around the 0" 2 24 (for single ) and 2 53 (for double have to be inserted) values, which all have the same absolute distance from each other and without these denormalized values ​​would not be representable, but would have to lead to underflow.

On the processor side, denormalized numbers are implemented with low priority due to their proportionally rare occurrence and therefore lead to a significant slowdown in execution as soon as they appear as an operand or as the result of a calculation. To remedy this (e.g. for computer games), Intel has been offering the non-IEEE 754 compliant functionality since SSE2 to completely deactivate denormalized numbers (MXCSR options “flush to zero” and “denormals are zero”). Floating point numbers that come into this range are rounded to 0.

Infinite

The floating point value infinite represents numbers whose magnitude is too large to be represented. A distinction is made between positive infinity and negative infinity. According to the definition of IEEE-754, the calculation of 1.0 / 0.0 results in “positive infinity”.

No number ( NaN )

This shows invalid (or undefined) results, e.g. B. when trying to calculate the square root of a negative number. Some “indefinite expressions” result in “no number”, for example 0.0 / 0.0 or “infinite” - “infinite”. In addition, NaNs are used in various application areas to represent “no value” or “unknown value”. In particular, the value with the bit pattern 111 ... 111 is often used for an "uninitialized floating point number".

IEEE 754 requires two types of non-numbers: silent NaN (NaNq - quiet ) and signaling NaN (NaNs - signaling ). Both explicitly do not represent numbers. In contrast to a silent NaN, a signaling NaN triggers an exception (trap) if it occurs as an operand of an arithmetic operation.

IEEE 754 enables the user to deactivate these traps. In this case, signaling NaN are treated like silent NaN.

Signaling NaN can be used to fill uninitialized computer memory, so that every use of an uninitialized variable automatically throws an exception.

Silent NaN make it possible to handle calculations that cannot produce a result, for example because they are not defined for the specified operands. Examples are division zero by zero or the logarithm of a negative number.

Silent and signaling NaN differ in the highest mantissa bit. With silent NaN this is 1, with signaling NaN 0. The remaining mantissa bits can contain additional information, e.g. B. the cause of NaN. This can be useful for exception handling. However, the standard does not stipulate what information is contained in the remaining mantissa bits. The evaluation of these bits is therefore platform-dependent.

The sign bit has no meaning with NaN. It is not specified which value the sign bit has for the returned NaN.

Roundings

IEEE 754 differentiates between binary rounding and binary-decimal rounding, for which lower quality requirements apply.

Binary rounding must be rounded to the nearest representable number. If this is not clearly defined (exactly in the middle between two representable numbers), it is rounded so that the least significant bit of the mantissa becomes 0. Statistically, in 50% of the cases it is rounded up, in the other 50% of the cases, so that the statistical drift described by Knuth is avoided in longer calculations.

An implementation that conforms to IEEE 754 must provide three further roundings that can be set by the programmer: Rounding towards + infinity (always round up), rounding towards - infinite (always round down) and rounding towards 0 (always reduce the amount).

Operations

IEEE 754 compliant implementations must provide operations for arithmetic, calculation of the square root, conversions, and comparisons. Another group of operations is recommended in the appendix, but not compulsory.

Arithmetic and square root

IEEE 754 requires exactly rounded results from a (hardware or software) implementation for the operations addition, subtraction, multiplication and division of two operands as well as the operation square root of an operand. This means that the result determined must be the same as that which arises from an exact execution of the corresponding operation with subsequent rounding.

It is also necessary to calculate the remainder after division with an integer result. This remainder is defined by , as an integer, or even if it is even . This remainder must be determined exactly without rounding. ${\ displaystyle r = xy \ cdot n}$ ${\ displaystyle n}$ ${\ displaystyle | n - {\ tfrac {x} {y}} | <{\ tfrac {1} {2}}}$ ${\ displaystyle n}$ ${\ displaystyle | n - {\ tfrac {x} {y}} | = {\ tfrac {1} {2}}}$ Conversions

Conversions are required between all supported floating point formats. When converting to a floating point format with less precision, it must be rounded off exactly as described under arithmetic.

IEEE 754 compliant implementations must provide conversions between all supported floating point formats and all supported integer formats. The IEEE 754 does not define the integer formats in more detail.

For every supported floating point format, there must be an operation that converts this floating point number into the exactly rounded whole number in the same floating point format.

Finally, there must be conversions between the binary floating point format and a decimal format that meet precisely described minimum quality requirements.

Comparisons

Floating point numbers according to IEEE 754 must be able to be compared. The standard defines the necessary comparison operations and the required results for all possible special cases (especially NaN, infinite and 0). Compared to the “school mathematical” comparisons (smaller, equal to or larger), a possible result according to IEEE 754 is especially unordered (“not classified”) if one of the comparison operands is NaN. Two NaN are fundamentally different, even if their bit patterns match.

Recommended operations

Ten additional operations are recommended in the appendix to the standard. Since they are basically needed in an implementation anyway, this recommendation ultimately boils down to passing the operations on to the programmer. These operations are (in C notation): copysign (x, y), invertsign (x), scalb (y, n), logb (x), nextafter (x, y), finite (x), isnan (x) , x ≠ y, unordered (x, y), class (x). The details of the implementation, especially again for the special cases NaN etc., are also suggested.

Exceptions, flags and traps

If exceptions occur during the calculation, status flags are set. The standard stipulates that the user can read and write these flags. The flags are "sticky": once they are set, they are retained until they are explicitly reset. For example, checking the flags is the only way to distinguish 1/0 (= infinity) from an overflow.

It is also recommended in the standard to enable trap handlers: If an exception occurs, the trap handler is called instead of setting the status flag. It is the responsibility of such trap handlers to set or delete the corresponding status flag.

Exceptions are divided into 5 categories in the standard: overflow, underflow, division by zero, invalid operation and inaccurate. A status flag is available for each class.

history

In the 1960s and early 1970s, each processor had its own format for floating point numbers and its own FPU, or floating point software, used to process that format. The same program could produce different results on different computers. The quality of the various floating point arithmetic was also very different.

Around 1976, Intel planned its own FPU for its microprocessors and wanted the best possible solution for the arithmetic to be implemented. In 1977, under the auspices of the IEEE, meetings began to standardize FPUs for floating point arithmetic for microprocessors. The second meeting took place in November 1977 in San Francisco , chaired by Richard Delp . One of the leading participants was William Kahan .

Around 1980, the number of proposals for the standard was reduced to two: The KCS proposal (according to its authors K ahan, C oonen and S tone, 1977) ultimately opposed the alternative from DEC (F format, D format and G format). A major milestone on the road to the norm was the discussion of how to deal with the underflow , which had been neglected by most programmers until then.

At the same time as the development of the standard, Intel largely implemented the standard proposals in the Intel FPU 8087 , which was used as a floating point coprocessor for the 8088. The first version of the standard was adopted in 1985 and expanded in 2008.

literature

• IEEE 754: reprinted in SIGPLAN Notices , Vol. 22, No. 2, Feb. 1987, pp. 9-25
• Jean-Michel Muller: Elementary Functions - Algorithms and Implementation . 2nd Edition. Birkhäuser, Lyon 2006, ISBN 0-8176-4372-9 .