IEEE 754-2008

The standard IEEE 754-2008 , the previous working title was IEEE 754r , is a necessary revision of the floating point standard IEEE 754 , which was adopted in 1985 . The old standard was very successful and was adopted in numerous processors and programming languages. The discussion on the revision began in 2001; the standard was adopted in June 2008 and passed in August 2008.

Main objectives

The main objectives of the adopted standard could be divided into

the merging of IEEE 754 and IEEE 854 ,
the reduction of implementation alternatives,
the removal of ambiguities of the previous IEEE 754,
an additional kumulierendes product fused multiply-add : FMA(A,B,C) = A·B + C,
in addition to single and double arithmetic with half and quadruple precision (in addition to 32 and 64 bits also 16 and 128 bits),
the decimal formats considered necessary by the financial sector (IEEE 854),
further variable formats and exchange formats,
min and max with specifications for the special cases ± 0 and ± ∞ as well as
Cosmetics: from now on "denormalized" means "subnormal"

The standard is intended to define formats and methods for floating point arithmetic as well as a minimum quality.

Formats

Formats include floating point numbers with half (16-bit), single (32-bit), double (64-bit), and quadruple (128-bit) precision. The half format represents a standardized minifloat . The basic formats are supplemented by extended and expandable (new!) Long-number formats. Data exchange formats have also been added. In addition to the 16/32/64/128 bit representations, representations with a multiple of 32 bits are defined.

Tightly packed decimal formats (DFP, 3 digits in 10 bits) have also been added. They differ from classic single-digit-based BCD formats as follows:

The capacity of the usable bits is used well, since 3 decimal digits (000 ... 999, 1000 values used) are stored in 10 bits each (0 ... 1023, 1024 possible values). One such group is called a Declet . The waste is significantly smaller compared to classic BCD numbers. The last column of the table contains the information content in bits, which is only slightly less than the storage space (with d = 7 mantissa digits and an exponent value range of emin - emax, taking into account the sign bits ). ${\ displaystyle 1 + d \ cdot \ log _ {2} 10+ \ log _ {2} (e _ {\ text {max}} - e _ {\ text {min}})}$
The processing of the decimal digits in groups of three corresponds to the usual grouping habit (23 223 456; 24 W, 24 kW, 24 MW).
The number 0 also has the bit pattern “0000… 0”. However, 0 has a relatively large cohort.
The numbers 0 to 9 of a declet have a 0 in the 6 leading bits.
The numbers 10 to 99 of a declet have a 0 in the 3 leading bits.
Odd numbers in declets can be recognized using a single bit.
The 24 unused bit patterns ddx11x111x with dd = 01, 10 or 11 can be easily identified.
The numbers packed with declets (densely packed) can no longer be sorted in binary, in contrast to "classic BCD formats".
Instead of being saved in declets, the mantissa can also be saved as an integer binary in a bit field of the same size. The bit field division is then different in the combination field.
A number is not unique; several bit patterns can designate the same number. The set of bit patterns in a number is called a cohort. However, a canonical representation was defined within each cohort.

Signaling NaNs were proposed for deletion (February 3, 2003), but were later included in the proposal (February 21, 2003). A Signaling NaN is a NaN with bit 7 set. Representations of exist and are easily recognizable. ${\ displaystyle \ pm \ infty}$

Type	storage needs	mantissa		exponent					Infor- mation content in bits
		Bits m	effective bits of a normalized number p	Bits e	Range of values			Values of the cohort of a normalized number
		Bits m	effective bits of a normalized number p	Bits e	e _min	e _max	Bias	Values of the cohort of a normalized number

b16 (half)	016 bit	010	011	05	000−14	00015th	00015th	1 ≤ E ≤ 30	016
b32 (single)	032 bit	023	024	08th	00−126	00127	00127	1 ≤ E ≤ 254	032
b64 (double)	064 bit	052	053	11	0−1022	01023	01023	1 ≤ E ≤ 2046	064
b128 (quad)	128 bit	112	113	15th	−16382	16383	16383	1 ≤ E ≤ 32766	128
k = 32j with j ≥ 4	00k bit	k - rnd (4 · ld (k)) + 12	k - rnd (4 · ld (k)) + 13	rnd (4 ld (k)) - 13	1 - emax	2 ^{k − p − 1} - 1		00emax	00k

d32	032 bit	020 + 5 ^(a)	07 digits	06th	00−95	0096	0101		031.83
d64	064 bit	050 + 5	16 digits	08th	0−383	0384	0398		063.73
d128	128 bit	110 + 5	34 digits	12	−6143	6144	6176		127.53
k = 32j with j ≥ 1	00k bit	15 k / 16 - 10	9 k / 32 - 2 digits	k / 16 + 4	1 - emax	3 x 2 ^{k / 16 + 3}	emax + p - 2

^(a) 20 + 5 in column 3 means:

6 decimal places are stored in the 20 bits (3 places in 10 bits each)
the 5 remaining bits are used to store:
- one more decimal place
- the remainder of the exponent when divided by 3
- Signaling for NaNs and Infs

Roundings

In addition to the four old IEEE 754 roundings, there is an additional one, so that the following roundings are required:

increasing (towards + infinity)
decreasing (towards -infinity)
reducing the amount (towards 0)
best possible and in the middle to the next even number (to next or to even)
best possible and increasing the amount in the middle (to next - new in IEEE 754r, actually only the classic manual rounding of invoices)

The IEEE 754 rounding (next even) was already proposed by Carl Friedrich Gauß and avoids a statistical imbalance in longer calculations towards larger numbers.

In the discussion about the new standard, this knowledge is apparently rejected again and the "hand billing rounding" (to next) is reintroduced.

Exceptions

Exceptions and exception handling are specified.

New functions are predicate functions (greater than or equal to) and operators for maximum and minimum. The main discussion here is the results for special values (NaN, Inf).

Decimal coding

Storage space requirements
DPD	Size for equivalent Packed BCD	Profit
032 bit	07 × 4 + 07.58 + 1 bit = 036.58 bit	0+4.48 bit
064 bit	16 × 4 + 09.58 + 1 bit = 074.58 bit	+10.48 bit
128 bit	34 × 4 + 13.58 + 1 bit = 150.58 bit	+22.48 bit

The primary idea behind the tightly packed decimal representation is that it can be recoded into a classic BCD representation for the mantissa and a binary exponent with extremely little (gate) effort, but at the same time uses the storage space as efficiently as possible. The actual processing then takes place in the classic BCD format; recoding is only required when reading and writing registers.

The coding of 32-bit, 64-bit and 128-bit decimal numbers is carried out according to the following scheme. For longer decimal encodings, 2 bits are added to the exponent and 30 bits to the mantissa (3 × 10 bits) for each additional 32-bit word, so that the exponent's range of values is quadrupled and the mantissa a further nine while maintaining the 5-bit combination field Digits received.

format	sign	Combo box		remaining exponent	remaining mantissa
032 bit	1 bit	5 bits		06 bits	020 bits
032 bit	s	mmmmm		xxxxxx	bbbbbbbbbb bbbbbbbbbb
064 bit	1 bit	5 bits		08 bits	050 bits
064 bit	s	mmmmm		xxxxxxxx	bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb
128 bit	1 bit	5 bits		12 bits	110 bits
128 bit	s	mmmmm		xxxxxxxxxxxx	bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbb bbbbbbbbbbb bbbbbbbbbbb bbbbbbbbbb bbbbbbbbbbb bbbbbbbbbbb bbbbbbbbbbb bbbbbbbb
	0 : positive 1 : negative	Coding of the MSBs according to table 1		binary coding	Each Declet is coded according to Table 2 and provides three additional digits.
	sign	Section 1	MSB +	LSB exponent	Section 2	Section 3	Section 4	Section 5	Section 6	Section 7	Clause 8	Clause 9	...

The number consists of

a sign: this is stored in the sign bit s .
an exponent that changes its range of values from e _min  ...  e _max to the values 0 ... 3 · 2 ^e  - 1 = (0 ... 2) · 2 ^e  + (0 ... 2 ^e  - 1). The upper three states are stored in the combo box , the remaining e bits in binary form in the remaining exponent.
a mantissa that consists of p = 3 · n + 1 digits. The most significant digit is stored in the combo box, the remaining 3 · n digits are stored in groups of three in the remaining mantissa.

The following coding tables are required for decoding and coding:

Table 1: Coding rules for the combo box of the MSBs of the exponent and the mantissa
m4	m3	m2	m1	m0	Exp.	Mant.	Code value	description
Combo box					MSBs of the		Code value	description
0	0	a	b	c	00	0 abc	(0-7)	Digit to 7
0	1	a	b	c	01	0 abc
1	0	a	b	c	10	0 abc
1	1	0	0	c	00	100 c	(8-9)	Digit greater than 7
1	1	0	1	c	01	100 c
1	1	1	0	c	10	100 c
1	1	1	1	0				± Infinity
1	1	1	1	1				NaN

comment: The sign bit of NaNs is ignored. The MSB of the remaining exponent determines whether the NAN is quiet or signaling.

Table 2: Coding rules for the declets of the tightly packed decimal digits of the remaining mantissa
b9	b8	b7	b6	b5	b4	b3	b2	b1	b0	d2	d1	d0	Coded value	description
DPD coded value	Decimal digits
a	b	c	d	e	f	0	G	H	i	0 abc	0 def	0 ghi	(0–7) (0–7) (0–7)	three digits to 7
a	b	c	d	e	f	1	0	0	i	0 abc	0 def	100 i	(0–7) (0–7) (8–9)	two digits up to 7, one greater than 7
a	b	c	G	H	f	1	0	1	i	0 abc	100 f	0 ghi	(0–7) (8–9) (0–7)
G	H	c	d	e	f	1	1	0	i	100 c	0 def	0 ghi	(8–9) (0–7) (0–7)
G	H	c	0	0	f	1	1	1	i	100 c	100 f	0 ghi	(8–9) (8–9) (0–7)	one digit up to 7, two digits greater than 7
d	e	c	0	1	f	1	1	1	i	100 c	0 def	100 i	(8–9) (0–7) (8–9)
a	b	c	1	0	f	1	1	1	i	0 abc	100 f	100 i	(0–7) (8–9) (8–9)
?	?	c	1	1	f	1	1	1	i	100 c	100 f	100 i	(8–9) (8–9) (8–9)	three digits greater than 7

Note: In contrast to the binary representation, in which normalization is forced by normalizing and omitting the MSB, no normalization is forced and the digit 0 is available as the most significant digit, numbers cannot be uniquely coded.

Decimal floating point numbers in practice

The problems with decimal floating point numbers include:

Most numbers cannot be represented precisely in either binary or decimal format. After a few calculation steps, most calculations are imprecise. A currency conversion or the deduction of sales tax is sufficient.
Most of the problems listed have simpler and more powerful solutions. For financial tasks .NET z. For example, the System.Decimal data type is available, which can exactly represent integers with amounts up to 79,228,162,514,264,337,593,543,950,335.
It represents a further source of errors for hardware (additional logic) and software (conversion errors).

The results are:

Decimal floating point numbers are standardized, but not available in fixed hardware even after 15 years. They can be implemented in software, in FPGAs and in ASICs, but even about this the publications are limited and are mostly limited to addition and subtraction.
The decimal formats are mainly required by the financial industry, but once you take a closer look, they are not needed. Fixed-point representations based on the smallest accounting unit and 64-bit integers exactly cover a range of values 922 × as large as Decimal64 (−92,233,720,368,547,758.08 ... + 92,233,720,368,547,758.07 compared to −99,999,999,999,999 , 99 ... + 99,999,999,999,999.99). However, they cannot represent even larger values with a reduced accuracy, nor can they represent smaller amounts more precisely.

They are useful:

unrestrictedly as interchange formats when the exact representation of decimal values is required.

Two opposing viewpoints collide here.

On the one hand, the memory, computing time and cost advantages, as well as the more even distribution of numbers of a dual format are emphasized.
On the other hand, it is argued that exact results (mostly results as in manual calculations) are only possible with decimal arithmetic and that in times of fast processors and cheap memory the disadvantages are no longer significant.

William Kahan has claimed that dual arithmetic will hardly play a role in the future.

“Why is decimal floating-point hardware a good idea anyway? Because it can help our industry avoid errors designed not to be found. "

“Why is decimal floating point hardware definitely a good idea? Because it helps our industry to avoid errors that cannot be found due to the process. "

- William Kahan : Floating-Point Arithmetic Besieged by “Business Decisions”

But he overlooks that

Packed decimal formats require additional chip space, are less efficient, and are slower.
Computing power of all sizes opens up new areas of responsibility and yet there will always be tasks for which it is insufficient.
There will never be so much computing power that one would voluntarily forego it.
The more complex the calculation, the less interested someone is whether it can be represented exactly in decimal format. Only a select few numbers have the honor of being keyed in by a person in the decimal system or being read by a person in the decimal system.

Web links

IEEE Standard for Floating-Point Arithmetic. (PDF; 915 kB) IEEE Computer Society, accessed on June 7, 2016 (English).
IEEE 754: Standard for Binary Floating-Point Arithmetic. IEEE Computer Society, accessed June 7, 2016 .

Individual evidence

↑ IEEE 754-2008: Standard for Floating-Point Arithmetic, IEEE Standards Association, 2008, doi: 10.1109 / IEEESTD.2008.4610935
^ Michael F. Cowlishaw : A Summary of Densely Packed Decimal encoding . IBM . February 13, 2007. Archived from the original on September 24, 2015. Retrieved on February 7, 2016.
^ William Kahan: Floating Point Arithmetic Besieged by "Business Decisions". (PDF, 174 kB) IEEE-Sponsored ARITH 17 Symposium on Computer Arithmetic, July 5, 2005, p. 6 of 28 , accessed on February 19, 2020 (English).

[ieee754-2008-1] IEEE 754-2008: Standard for Floating-Point Arithmetic, IEEE Standards Association, 2008, doi: 10.1109 / IEEESTD.2008.4610935

[Cowlishaw_2000-2] Michael F. Cowlishaw : A Summary of Densely Packed Decimal encoding . IBM . February 13, 2007. Archived from the original on September 24, 2015. Retrieved on February 7, 2016.

[3] William Kahan: Floating Point Arithmetic Besieged by "Business Decisions". (PDF, 174 kB) IEEE-Sponsored ARITH 17 Symposium on Computer Arithmetic, July 5, 2005, p. 6 of 28 , accessed on February 19, 2020 (English).