IEEE 754-2008

from Wikipedia, the free encyclopedia

The standard IEEE 754-2008 , the previous working title was IEEE 754r , is a necessary revision of the floating point standard IEEE 754 , which was adopted in 1985 . The old standard was very successful and was adopted in numerous processors and programming languages. The discussion on the revision began in 2001; the standard was adopted in June 2008 and passed in August 2008.

Main objectives

The main objectives of the adopted standard could be divided into

  • the merging of IEEE 754 and IEEE 854 ,
  • the reduction of implementation alternatives,
  • the removal of ambiguities of the previous IEEE 754,
  • an additional kumulierendes product fused multiply-add : FMA(A,B,C) = A·B + C,
  • in addition to single and double arithmetic with half and quadruple precision (in addition to 32 and 64 bits also 16 and 128 bits),
  • the decimal formats considered necessary by the financial sector (IEEE 854),
  • further variable formats and exchange formats,
  • min and max with specifications for the special cases ± 0 and ± ∞ as well as
  • Cosmetics: from now on "denormalized" means "subnormal"

The standard is intended to define formats and methods for floating point arithmetic as well as a minimum quality.

Formats

Formats include floating point numbers with half (16-bit), single (32-bit), double (64-bit), and quadruple (128-bit) precision. The half format represents a standardized minifloat . The basic formats are supplemented by extended and expandable (new!) Long-number formats. Data exchange formats have also been added. In addition to the 16/32/64/128 bit representations, representations with a multiple of 32 bits are defined.

Tightly packed decimal formats (DFP, 3 digits in 10 bits) have also been added. They differ from classic single-digit-based BCD formats as follows:

  • The capacity of the usable bits is used well, since 3 decimal digits (000 ... 999, 1000 values ​​used) are stored in 10 bits each (0 ... 1023, 1024 possible values). One such group is called a Declet . The waste is significantly smaller compared to classic BCD numbers. The last column of the table contains the information content in bits, which is only slightly less than the storage space (with d = 7 mantissa digits and an exponent value range of emin - emax, taking into account the sign bits ).
  • The processing of the decimal digits in groups of three corresponds to the usual grouping habit (23 223 456; 24 W, 24 kW, 24 MW).
  • The number 0 also has the bit pattern “0000… 0”. However, 0 has a relatively large cohort.
  • The numbers 0 to 9 of a declet have a 0 in the 6 leading bits.
  • The numbers 10 to 99 of a declet have a 0 in the 3 leading bits.
  • Odd numbers in declets can be recognized using a single bit.
  • The 24 unused bit patterns ddx11x111x with dd = 01, 10 or 11 can be easily identified.
  • The numbers packed with declets (densely packed) can no longer be sorted in binary, in contrast to "classic BCD formats".
  • Instead of being saved in declets, the mantissa can also be saved as an integer binary in a bit field of the same size. The bit field division is then different in the combination field.
  • A number is not unique; several bit patterns can designate the same number. The set of bit patterns in a number is called a cohort. However, a canonical representation was defined within each cohort.

Signaling NaNs were proposed for deletion (February 3, 2003), but were later included in the proposal (February 21, 2003). A Signaling NaN is a NaN with bit 7 set. Representations of exist and are easily recognizable.

Type storage
needs
mantissa exponent Infor-
mation
content
in bits
Bits m effective bits of a
normalized number
p
Bits e Range of values Values ​​of the
cohort of a
normalized number
e min e max Bias
b16 (half) 016 bit 010 011 05 000−14 00015th 00015th 1 ≤ E ≤ 30 016
b32 (single) 032 bit 023 024 08th 00−126 00127 00127 1 ≤ E ≤ 254 032
b64 (double) 064 bit 052 053 11 0−1022 01023 01023 1 ≤ E ≤ 2046 064
b128 (quad) 128 bit 112 113 15th −16382 16383 16383 1 ≤ E ≤ 32766 128
k = 32j with j ≥ 4 00k bit k - rnd (4 · ld (k)) + 12 k - rnd (4 · ld (k)) + 13 rnd (4 ld (k)) - 13 1 - emax 2 k − p − 1 - 1 00emax 00k
d32 032 bit 020 + 5 (a) 07   digits 06th 00−95 0096 0101 031.83
d64 064 bit 050 + 5 16   digits 08th 0−383 0384 0398 063.73
d128 128 bit 110 + 5 34   digits 12 −6143 6144 6176 127.53
k = 32j with j ≥ 1 00k bit 15 k / 16 - 10 9 k / 32 - 2   digits k / 16 + 4 1 - emax 3 x 2 k / 16 + 3 emax + p - 2
(a) 20 + 5 in column 3 means:
  • 6 decimal places are stored in the 20 bits (3 places in 10 bits each)
  • the 5 remaining bits are used to store:
    • one more decimal place
    • the remainder of the exponent when divided by 3
    • Signaling for NaNs and Infs

Roundings

In addition to the four old IEEE 754 roundings, there is an additional one, so that the following roundings are required:

  • increasing (towards + infinity)
  • decreasing (towards -infinity)
  • reducing the amount (towards 0)
  • best possible and in the middle to the next even number (to next or to even)
  • best possible and increasing the amount in the middle (to next - new in IEEE 754r, actually only the classic manual rounding of invoices)

The IEEE 754 rounding (next even) was already proposed by Carl Friedrich Gauß and avoids a statistical imbalance in longer calculations towards larger numbers.

In the discussion about the new standard, this knowledge is apparently rejected again and the "hand billing rounding" (to next) is reintroduced.

Exceptions

Exceptions and exception handling are specified.

New functions are predicate functions (greater than or equal to) and operators for maximum and minimum. The main discussion here is the results for special values ​​(NaN, Inf).

Decimal coding

Storage space requirements
DPD Size for equivalent Packed BCD Profit
032 bit 07 × 4 + 07.58 + 1 bit = 036.58 bit 0+4.48 bit
064 bit 16 × 4 + 09.58 + 1 bit = 074.58 bit +10.48 bit
128 bit 34 × 4 + 13.58 + 1 bit = 150.58 bit +22.48 bit

The primary idea behind the tightly packed decimal representation is that it can be recoded into a classic BCD representation for the mantissa and a binary exponent with extremely little (gate) effort, but at the same time uses the storage space as efficiently as possible. The actual processing then takes place in the classic BCD format; recoding is only required when reading and writing registers.

The coding of 32-bit, 64-bit and 128-bit decimal numbers is carried out according to the following scheme. For longer decimal encodings, 2 bits are added to the exponent and 30 bits to the mantissa (3 × 10 bits) for each additional 32-bit word, so that the exponent's range of values ​​is quadrupled and the mantissa a further nine while maintaining the 5-bit combination field Digits received.

format sign Combo box remaining exponent remaining mantissa
032 bit 1 bit 5 bits 06 bits 020 bits
s mmmmm xxxxxx bbbbbbbbbb bbbbbbbbbb
064 bit 1 bit 5 bits 08 bits 050 bits
s mmmmm xxxxxxxx bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb
128 bit 1 bit 5 bits 12 bits 110 bits
s mmmmm xxxxxxxxxxxx bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbbb
bbbbbbbbbbb bbbbbbbbbb bbbbbbbbbbb bbbbbbbbbbb bbbbbbbbbbb
bbbbbbbb
0 : positive
1 : negative
Coding of the MSBs
according to table 1
binary
coding
Each Declet is coded according to Table 2 and provides three additional digits.
sign Section 1 MSB + LSB exponent Section 2 Section 3 Section 4 Section 5 Section 6 Section 7 Clause 8 Clause 9 ...

The number consists of

  • a sign: this is stored in the sign bit s .
  • an exponent that changes its range of values ​​from e min  ...  e max to the values ​​0 ... 3 · 2 e  - 1 = (0 ... 2) · 2 e  + (0 ... 2 e  - 1). The upper three states are stored in the combo box , the remaining e bits in binary form in the remaining exponent.
  • a mantissa that consists of p = 3 · n + 1 digits. The most significant digit is stored in the combo box, the remaining 3 · n digits are stored in groups of three in the remaining mantissa.

The following coding tables are required for decoding and coding:

Table 1: Coding rules for the combo box of the MSBs of the exponent and the mantissa
Combo box MSBs of the Code
value
description
m4 m3 m2 m1 m0 Exp. Mant.
0 0 a b c 00 0 abc (0-7) Digit to 7
0 1 a b c 01 0 abc
1 0 a b c 10 0 abc
1 1 0 0 c 00 100 c (8-9) Digit greater than 7
1 1 0 1 c 01 100 c
1 1 1 0 c 10 100 c
1 1 1 1 0 ± Infinity
1 1 1 1 1 NaN
comment
The sign bit of NaNs is ignored. The MSB of the remaining exponent determines whether the NAN is quiet or signaling.
Table 2: Coding rules for the declets of the tightly packed decimal digits of the remaining mantissa
DPD coded value Decimal digits
b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 d2 d1 d0 Coded value description
a b c d e f 0 G H i 0 abc 0 def 0 ghi (0–7) (0–7) (0–7) three digits to 7
a b c d e f 1 0 0 i 0 abc 0 def 100 i (0–7) (0–7) (8–9) two digits up to 7,
one greater than 7
a b c G H f 1 0 1 i 0 abc 100 f 0 ghi (0–7) (8–9) (0–7)
G H c d e f 1 1 0 i 100 c 0 def 0 ghi (8–9) (0–7) (0–7)
G H c 0 0 f 1 1 1 i 100 c 100 f 0 ghi (8–9) (8–9) (0–7) one digit up to 7,
two digits greater than 7
d e c 0 1 f 1 1 1 i 100 c 0 def 100 i (8–9) (0–7) (8–9)
a b c 1 0 f 1 1 1 i 0 abc 100 f 100 i (0–7) (8–9) (8–9)
? ? c 1 1 f 1 1 1 i 100 c 100 f 100 i (8–9) (8–9) (8–9) three digits greater than 7
Note
In contrast to the binary representation, in which normalization is forced by normalizing and omitting the MSB, no normalization is forced and the digit 0 is available as the most significant digit, numbers cannot be uniquely coded.

Decimal floating point numbers in practice

The problems with decimal floating point numbers include:

  • Most numbers cannot be represented precisely in either binary or decimal format. After a few calculation steps, most calculations are imprecise. A currency conversion or the deduction of sales tax is sufficient.
  • Most of the problems listed have simpler and more powerful solutions. For financial tasks .NET z. For example, the System.Decimal data type is available, which can exactly represent integers with amounts up to 79,228,162,514,264,337,593,543,950,335.
  • It represents a further source of errors for hardware (additional logic) and software (conversion errors).

The results are:

  • Decimal floating point numbers are standardized, but not available in fixed hardware even after 15 years. They can be implemented in software, in FPGAs and in ASICs, but even about this the publications are limited and are mostly limited to addition and subtraction.
  • The decimal formats are mainly required by the financial industry, but once you take a closer look, they are not needed. Fixed-point representations based on the smallest accounting unit and 64-bit integers exactly cover a range of values ​​922 × as large as Decimal64 (−92,233,720,368,547,758.08 ... + 92,233,720,368,547,758.07 compared to −99,999,999,999,999 , 99 ... + 99,999,999,999,999.99). However, they cannot represent even larger values ​​with a reduced accuracy, nor can they represent smaller amounts more precisely.

They are useful:

  • unrestrictedly as interchange formats when the exact representation of decimal values ​​is required.

Two opposing viewpoints collide here.

  • On the one hand, the memory, computing time and cost advantages, as well as the more even distribution of numbers of a dual format are emphasized.
  • On the other hand, it is argued that exact results (mostly results as in manual calculations) are only possible with decimal arithmetic and that in times of fast processors and cheap memory the disadvantages are no longer significant.

William Kahan has claimed that dual arithmetic will hardly play a role in the future.

“Why is decimal floating-point hardware a good idea anyway? Because it can help our industry avoid errors designed not to be found. "

“Why is decimal floating point hardware definitely a good idea? Because it helps our industry to avoid errors that cannot be found due to the process. "

- William Kahan : Floating-Point Arithmetic Besieged by “Business Decisions”

But he overlooks that

  • Packed decimal formats require additional chip space, are less efficient, and are slower.
  • Computing power of all sizes opens up new areas of responsibility and yet there will always be tasks for which it is insufficient.
  • There will never be so much computing power that one would voluntarily forego it.
  • The more complex the calculation, the less interested someone is whether it can be represented exactly in decimal format. Only a select few numbers have the honor of being keyed in by a person in the decimal system or being read by a person in the decimal system.

Web links

Individual evidence

  1. IEEE 754-2008: Standard for Floating-Point Arithmetic, IEEE Standards Association, 2008, doi: 10.1109 / IEEESTD.2008.4610935
  2. ^ Michael F. Cowlishaw : A Summary of Densely Packed Decimal encoding . IBM . February 13, 2007. Archived from the original on September 24, 2015. Retrieved on February 7, 2016.
  3. ^ William Kahan: Floating Point Arithmetic Besieged by "Business Decisions". (PDF, 174 kB) IEEE-Sponsored ARITH 17 Symposium on Computer Arithmetic, July 5, 2005, p. 6 of 28 , accessed on February 19, 2020 (English).