# Floating point number

A floating point - often floating point number called ( English floating point number or a short float , literally number with flottierendem point or [probably continue lehnübersetzt ] floating point ) - is an approximate representation of a real number .

Exactly representable floating point numbers for different mantissa lengths, base: 2, exponent −3 to 1

The set of floating point numbers is a subset of the rational numbers . Together with the operations defined on them ( floating point arithmetic ), the floating point numbers form a finite arithmetic that was developed primarily with regard to numerical calculations with ( binary ) computers .

## Basics

### problem

All (mechanical or electronic) calculation aids from the abacus to the computer use fixed-point numbers as the simplest form of number representation . A mostly limited sequence of digits is saved and the comma is assumed at a fixed position.

In the case of larger calculations, overflows inevitably occur, which make it necessary to scale the values ​​and recalculate them in order to bring the final result and all intermediate results into the permitted value range. This scaling is time consuming and needs to be automated.

### Exponential notation

An obvious idea that leads directly to floating point numbers is to also store the exact place of the comma for each value. This means mathematically nothing but the representation of the number with two values, the mantissa and the exponent : . The freedom in the choice of the exponent can be used to bring the mantissa into a fixed range of values, for example . This step is called normalizing the mantissa. ${\ displaystyle x}$ ${\ displaystyle m}$ ${\ displaystyle e}$${\ displaystyle x = m \ cdot 10 ^ {e}}$${\ displaystyle 1 \ leq m <10}$

Example: The value of the speed of light is

{\ displaystyle {\ begin {aligned} c & = 299 \, 792 \, 458 \; {\ text {m / s}} \\ & = 299 \, 792 {,} 458 \ cdot 10 ^ {3} {\ text {m / s}} \\ & = 0 {,} 299 \, 792 \, 458 \ cdot 10 ^ {9} {\ text {m / s}} \\ & = 2 {,} 997 \, 924 \ .58 \ cdot 10 ^ {8} {\ text {m / s}} \ end {aligned}}}

Only the mantissa of the last representation is normalized.

This notation has long been used by physicists and mathematicians to indicate very large and very small numbers. Even today, the floating point notation on pocket calculators is therefore specifically referred to as the scientific format (sci) .

A compact variant of the exponential notation is often used by programmers for inputting and outputting numbers in text form, e.g. E.g. in the source code or in the debugger : 2.99792458e8 (= 2,997.924.58 · 10 8 ) or 3.141592653589793d0 ( d is used for numbers with double precision ). The e or d is to be understood as a short form of “times 10 to the power”.

### Floating point arithmetic

For calculations with floating point numbers, each number and each intermediate result is scaled individually (as opposed to global scaling). The scaling (calculation of the exponent) of each intermediate result requires additional computational effort and was therefore avoided as far as possible until well into the 1980s. The PCs of that time did not have a floating point processor as standard . Another factor was the higher memory requirements of floating point numbers, which could only be limited by foregoing higher precision. Accordingly, only the first had supercomputers (number cruncher) a floating point or at least one hardware support a software floating-point arithmetic.

The choice of base 10 is arbitrary and only owed to the usual decimal system. Floating-point numbers can be represented with any base; in general, any base selected applies . Use calculators (predominantly), (rarely today) or (e.g. for financial mathematics, see below). Any base is the condition for normalized numbers . ${\ displaystyle x = m \ cdot b ^ {e}}$${\ displaystyle b}$${\ displaystyle b = 2}$${\ displaystyle b = 16}$${\ displaystyle b = 10}$${\ displaystyle 1 \ leq m

### historical development

The first documented use of the floating-point dates back about 2700 years old: In Mesopotamia (Mesopotamia) scientific calculations were performed using the base carried out and the exponent (a mostly small integer) is simply carried in the head. Until recently, the same procedure was common for calculations with a slide rule . ${\ displaystyle b = 60}$

In calculating machines , Konrad Zuse used his own floating point representation for his computers Z1 and Z3 for the first time.

## presentation

In the previous section, the basic parameters of a floating point number were already presented. There are base , number of mantissa places and number of exponent places . In addition, there are other parameters that are intended to facilitate the arithmetic operations during calculations. This section briefly describes the parameters and bit fields of a general floating point number. ${\ displaystyle b}$${\ displaystyle p}$${\ displaystyle r}$

### Base

${\ displaystyle 2 {,} 997.924.58 \ cdot {\ color {red} 10} ^ {8}}$

One parameter is the chosen base . Numbers that are processed directly by humans use either or . In this special case the prefixes kilo = 1000 1 , mega = 1000 2 , giga = 1000 3 , tera = 1000 4 and milli = 1000 −1 , micro = 1000 −2 , nano = 1000 −3 , pico are used for the exponent = 1000 −4 of the international system of units . ${\ displaystyle {\ boldsymbol {b}}}$${\ displaystyle b = 10}$${\ displaystyle b = 1000}$

In the computer the dual system and its relatives have prevailed and it is the bases , and common. Since the IEEE 754 standard for floating point numbers, the base has been used almost exclusively in modern computers . ${\ displaystyle b = 2}$${\ displaystyle b = 8}$${\ displaystyle b = 16}$${\ displaystyle b = 2}$

### mantissa

${\ displaystyle {\ color {red} 2} {,} {\ color {red} 997} {.} {\ color {red} 924} {.} {\ color {red} 58} \ cdot 10 ^ {8 }}$

The mantissa contains the digits of the floating point number. If you save more digits, the accuracy increases. The number of mantissa digits expresses how exactly the number is approximated. This floating point is either directly indicated or in the form of the smallest number described that can be added to one and one by one different result supplies ( ; minimal!) (W d in properties..). ${\ displaystyle {\ boldsymbol {m}}}$${\ displaystyle p}$${\ displaystyle \ epsilon}$${\ displaystyle 1+ \ epsilon> 1}$${\ displaystyle \ epsilon}$

Example: For IEEE-754 numbers of the type Single with the base , the mantissa is digits long. Here is 1.19209289551e − 0007. ${\ displaystyle b = 2}$${\ displaystyle p = 23}$${\ displaystyle \ epsilon =}$

### exponent

${\ displaystyle 2 {,} 997.924.58 \ cdot 10 ^ {\ color {red} 8}}$

After normalization, the exponent saves the exact place of the comma and thus the order of magnitude of the number. The number of exponent digits limits the range of possible variations of the comma and thus describes the range of values ​​of the displayed floating point numbers. In order to describe a system of floating point numbers, one specifies the smallest and the largest possible exponent or also the number of exponents and the shift to 0 (bias) . ${\ displaystyle {\ boldsymbol {e}}}$${\ displaystyle r}$

Example: For IEEE 754 numbers of the type Single with the base , the smallest possible exponent is −126 and the largest is 127. This means that the largest floating point number that can be represented in this system and the smallest normalized floating point number . These values, and , describe the permissible range of values. ${\ displaystyle b = 2}$${\ displaystyle \ max = 1 {,} 1111 \ ldots \ cdot 2 ^ {127} \ approx 3 {,} 4 \ cdot 10 ^ {+ 38}}$${\ displaystyle \ operatorname {minpos} = 1 {,} 00000 \ ldots \ cdot 2 ^ {- 126} \ approx 1 {,} 175 \ cdot 10 ^ {- 38}}$${\ displaystyle \ max = 3 {,} 4 \ cdot 10 ^ {+ 38}}$${\ displaystyle \ operatorname {minpos} = 1 {,} 175 \ cdot 10 ^ {- 38}}$

### normalization

${\ displaystyle {\ color {red} 2 {,}} 997.924.58 \ cdot 10 ^ {8}}$

The representation of a floating point number is initially not clearly defined. The number 2 can be written as or as well . ${\ displaystyle 2 {,} 0 \ cdot 10 ^ {0}}$${\ displaystyle 0 {,} 2 \ cdot 10 ^ {1}}$

In order to force the use of a clearly defined representation, normalized floating point numbers are therefore often used, in which the mantissa is brought into a defined range. Two obvious normalization conditions are and . According to the first rule , the number 2 would be written as, the representation would then not be allowed. Calculating with normalized numbers is easier, which is why in the past some implementers of floating point arithmetic only allowed normalized numbers. However, the number 0 cannot be displayed in a normalized manner. ${\ displaystyle 1 / b \ leq m <1}$${\ displaystyle 1 \ leq m ${\ displaystyle 0 {,} 2 \ cdot 10 ^ {1}}$${\ displaystyle 2 {,} 0 \ cdot 10 ^ {0}}$

A distinction is made - in relation to the usual base 10 in the number system:

• Scientific notation with consequent normalization on${\ displaystyle 1 \ leq m <10}$
• Example: 10000 = 1e4th
• Technical notation with normalization to , with f as the power of the number of remaining significant digits of the measurement uncertainty for the computational accuracy (denormalized digits) . Only multiples of 3 appear in the exponent - when calculating with units of measurement , this representation can be easily converted into the unit prefixes as well as the grouping of digits with thousands separating them, or it can be generated from them ${\ displaystyle 1 / f \ leq m <1000}$
• Example: 10,000 m = 10e3 m= 10 km
• Significance: 10.00e3 m= 10,000 ± 5 m (4 significant digits relating to measurement in kilometers withrounding), but0.01e6 m= 10,000 ± 5000 m (2 significant digits with regard to measurement in mm) - with the precise information about standard and extended measurement uncertainty followingDIN1319-3 or theISO / BIPM Guide(GUM, ENV 13005)
• IEEE 754 (floating point numbers for microprocessors) uses the normalization condition for normalized numbers and allows additional denormalized (subnormal) numbers between 0 and minpos.${\ displaystyle 1 \ leq m <2}$

### Representation of the exponent sign with or without bias

In floating point systems, the exponent is a signed number. This requires the implementation of additional signed integer arithmetic for exponent calculations. This additional effort can be avoided if a fixed number , the bias value or excess, is added to the exponent and the sum is saved instead of the exponent . This sum is then an unsigned positive number. Mostly the use of a bias is combined with the representation of the 0 through . ${\ displaystyle e}$${\ displaystyle B}$${\ displaystyle e}$${\ displaystyle E = e + B}$${\ displaystyle B}$${\ displaystyle e = 0}$

An alternative that is seldom encountered today is the representation of the exponent in two's complement , in one's complement or as an absolute sign number.

The advantage of biased representation is that it makes it easier to compare the size of two positive floating point numbers. It is sufficient that number sequences em , so each exponent e followed by mantissa m , lexikografisch to compare. A floating point subtraction followed by a comparison to zero would be far more complex. The disadvantage of the biased representation compared to the two's complement representation is that after adding two biased exponents, the bias has to be subtracted in order to obtain the correct result.

IEEE 754 uses the representation with B = 127 for single and B = 1023 for double.

### Sign of the number

The sign v of a floating point number (+ or -; also +1 or −1) can always be encoded in a bit. Most of the time, the bit is used for positive numbers (+) and the bit for negative numbers (-). Mathematically you can write${\ displaystyle S = 0}$${\ displaystyle S = 1}$${\ displaystyle v = (- 1) ^ {S}}$

### Brief description of the parameters

In recent years, the following summary of the essential parameters has , , and a Floating Point Numbers enforced. Here you separately writes the sizes 1, by points , , and on in that order. The 1 is the number of sign bits. An IEEE 754 single number with 1 sign bit, 8 exponent bits and 23 mantissa bits is therefore a 1.8.23.127.2 floating point number. If the base and the bias emerge from the context, both can be omitted and one speaks of a 1.8.23 floating point number. ${\ displaystyle b}$${\ displaystyle p}$${\ displaystyle r}$${\ displaystyle B}$${\ displaystyle r}$${\ displaystyle p}$${\ displaystyle B}$${\ displaystyle b}$${\ displaystyle b}$${\ displaystyle B}$

A second common notation omits the sign bit and only specifies the length of the mantissa and the length of the exponent: s23e8.

With these notations the following applies to IEEE-754 numbers:

half:   1.5.10.15.2,       1.5.10          oder s10e5
single: 1.8.23.127.2,      1.8.23          oder s23e8
double: 1.11.52.1023.2,    1.11.52         oder s52e11
oct:    1.19.236.262143.2, 1.19.236.262143 oder s236e19


### Hidden bit

One bit can be saved when displaying normalized mantissas in the binary system . Since the first digit of a normalized number is always unequal to 0, this digit is always equal to 1 in the binary system. A digit with the fixed value 1 no longer has to be saved explicitly because it is implicitly known. When implemented, is of a hidden bit ( Engl. Literally "hidden bit spoken"). The mentioned IEEE format for floating point numbers makes use of this saving possibility, but not the internal 80-bit format of the Intel CPUs.

For example, if the number 5.25 is to be converted into a short real (single precision number) according to IEEE 754, the comma is shifted twice to the left after the intermediate step of binary conversion to 101.01, so that a standardized binary representation with 1.0101e2 is given. Due to the hidden bit, only the sequence 0101 is transferred to the 23-digit mantissa. The use of a hidden bit, however, requires a separate representation of the zero, since each mantissa represents a value greater than 0 due to the hidden bit.

## Properties of a floating point arithmetic

Floating-point numbers come up with some surprises, especially for the mathematician, which often influence the results of pocket calculators and computer calculations. Most important are common mathematical rules that have been overridden. Anyone who works intensively with a calculation aid must know these properties. They are due to the limited accuracy with which the mantissa and exponent are stored. The consequence of this limitation becomes clear when one considers that the infinitely many real numbers should be represented by a finite number of combinations of digits. The floating point numbers in the domain of a system can be thought of as a long table of discrete values. A floating point function then assigns a different value to each value in this list. The same applies to two- and multi-digit operations. The corresponding value ranges are shown graphically in the article Minifloats .

This results in the slight to absolute inaccuracy of the calculations and the invalidated validity of common mathematical calculation rules.

### Extinction

Erasure is the effect that when subtracting numbers of almost the same size, the result is wrong.

Example:

If you subtract and the number 3.141 in a four-digit floating point arithmetic ( , ), the uninhibited layman expects a correctly rounded result . ${\ displaystyle \ pi = 3 {,} 141592653589793 \ ldots}$${\ displaystyle b = 10}$${\ displaystyle p = 4}$${\ displaystyle \ pi -3 {,} 141 = 0 {,} 000592653589793 \ ldots \ approx 5 {,} 927 \ ldots \ cdot 10 ^ {- 4}}$

In fact, the result is : The four-digit rounded value of is , so the result of the calculation becomes . This result comes about because the output variables are already shown, especially in floating point arithmetic, and are not exactly available. ${\ displaystyle 1 {,} 0000 \ cdot 10 ^ {- 3}}$${\ displaystyle \ pi = 3 {,} 141592653589793 \ ldots}$${\ displaystyle \ pi = 3 {,} 142 \}$${\ displaystyle 3 {,} 142-3 {,} 141 = 0 {,} 001 = 1 {,} 0000 \ cdot 10 ^ {- 3}}$${\ displaystyle \ pi}$

### Numbers of various magnitudes (absorption)

The addition or subtraction of a number that is much smaller in amount does not change the larger number.

In the example of four-digit decimal arithmetic ( , ), adding 0.001 to 100 does not change anything on the larger operand. The same goes for subtraction: ${\ displaystyle b = 10}$${\ displaystyle p = 4}$

• ${\ displaystyle 1 {,} 000 \ cdot 10 ^ {2} +1 {,} 000 \ cdot 10 ^ {- 3} = 1 {,} 000 \ cdot 10 ^ {2} +0 {,} 000 | 01 \ cdot 10 ^ {2} = 1 {,} 000 \ cdot 10 ^ {2} +0 {,} 000 \ cdot 10 ^ {2} = 1 {,} 000 \ cdot 10 ^ {2}}$

(The digits behind the line | are omitted when scaling)

### Lower course

Since there is a smallest positive number in the floating point representation, below which no value can be represented, a result in this area is usually represented by 0. In this case one speaks of an underflow. If it is an interim result, all information about the result has been lost at this point. In some cases the accuracy of the end result is not affected, but in other cases the resulting end result can be completely wrong.

### Invalidity of the associative and distributive laws

The addition and multiplication of floating point numbers is not associative , that is, in general:

• ${\ displaystyle (x + y) + z \ neq x + (y + z)}$
• ${\ displaystyle (x \ cdot y) \ cdot z \ neq x \ cdot (y \ cdot z)}$

The addition and multiplication of floating point numbers is also not distributive , which means in general:

• ${\ displaystyle x \ cdot (y + z) \ neq (x \ cdot y) + (x \ cdot z)}$
• ${\ displaystyle (x + y) \ cdot z \ neq (x \ cdot z) + (y \ cdot z)}$

### Solvability of equations

In floating point arithmetic, some normally unsolvable equations have a solution. This effect is even used to describe such a floating point system.

Example:

The equation has no solution for the real numbers . ${\ displaystyle (1 + x) = 1}$${\ displaystyle x \ neq 0}$

In floating point arithmetic, this equation has many solutions, namely all numbers that are too small to have any effect on the sum. Again with the example of four-digit decimal fractions ( , ) the following applies (the dash | marks the places omitted in the addition): ${\ displaystyle b = 10}$${\ displaystyle p = 4}$

• 1 + 1e − 3 , 0= 1,000 + 0.001 | 000000… = 1,000 + 0.001 = 1.001> 1
• 1 + 1e − 4 , 0= 1,000 + 0,000 | 10000 ... 0= 1,000 + 0,000 = 1,000 = 1
• 1 + 2,3e − 5 = 1,000 + 0,000 | 023000 ... = 1,000 + 0,000 = 1,000 = 1

The already mentioned above smallest number that can be added to one and one by one different result supplies ( ;  ! Minimal) called machine accuracy . ${\ displaystyle \ epsilon}$${\ displaystyle 1+ \ epsilon> 1}$${\ displaystyle \ epsilon}$

### Conversions

If the base is other than 10, the numbers must be converted between the present floating point system and the decimal system in order to obtain a human readable representation. This is usually programmed quickly (and often imprecisely). An already old and important requirement for this conversion is its bit-exact reversibility. A result presented in the decimal system should be able to be read in again and reproduce the same representation in the floating point system with bit accuracy.

This requirement is often ignored. An exception is Java, which observes the following sentence:

Theorem: One can show that it is not sufficient to round up the number of decimal places calculated on the basis of the mantissa accuracy and to produce these decimal places rounded. However, one additional digit is sufficient (Theorem 15). This is the reason why an additional and apparently superfluous digit always appears in the representation of real numbers that are produced by Java programs.

### Decimal fractions

Even simple decimal fractions such as 0.1 cannot be represented exactly as binary floating point numbers, since every rational number whose abbreviated denominator is not a power of two leads to a non-terminating, periodic representation in the binary system. Only the first digits of this are saved, which results in inaccuracy. Decimal 0.1 is binary 0.0001100110011 ... However, for binary floating point systems with appropriate rounding rules, it was proven that the representation of 0.1 multiplied by 10 results in exactly 1 again. In general, if the rounding is correct, (m / 10) · 10 = m (Goldberg's Theorem 7 for the specific case n = 2 1 + 2 3 = 10). ${\ displaystyle p}$

In disciplines like financial mathematics , results are often required that exactly match a manual calculation. This is only possible with decimal floating point arithmetic or - with some "twisting" - with fixed point arithmetic.

### Check for equality

The restriction mentioned in the section on decimal fractions , that many of these decimal numbers cannot be represented exactly in the binary system of a computer , has an impact on comparisons between floating point numbers when programming . An example in the C language should make this clear: ${\ displaystyle x = y}$

#include <stdio.h>
int main(void) {
if (0.362 * 100.0 != 36.2)
puts("verschieden");

if (0.362 * 100.0 / 100.0 != 0.362)
puts("auch verschieden");
return 0;
}


Although the two equations and are mathematically correct, they become incorrect because of the inaccurate conversion into the computer binary system. In the example program, both inequalities are therefore regarded as true. ${\ displaystyle 0 {,} 362 \ cdot 100 = 36 {,} 2}$${\ displaystyle 0 {,} 362 \ cdot 100/100 = 0 {,} 362}$

Comparisons must therefore be replaced by a query as to whether the values ​​to be compared can be regarded as the same within the framework of an achievable accuracy (usually called tolerance ). ${\ displaystyle \ varepsilon}$

If one tolerates an absolute error in the comparison, one possible formulation is . ${\ displaystyle \ vert xy \ vert \ leq \ varepsilon}$

If one tolerates a relative error in the comparison, one possible formulation is . The second case usually has to be connected to the special case query. ${\ displaystyle \ vert 1 - {\ tfrac {y} {x}} \ vert \ leq \ varepsilon}$${\ displaystyle x \ neq 0}$

Alternatively, all factors or summands including the expected result of those problematic comparisons in the case of rational floating point numbers can also be multiplied by, whereby the index indicates the last decimal place. More generally: All floating point numbers must - if possible - be converted into fractions. These can in turn be converted into the binary number system clearly and without rounding the mantissa. Compilers of some programming languages ​​(e.g. Java , Ruby , C ++ , Objective-C , Swift , Rust , etc.) can calculate directly with the fractions and compare the fractions in the above-mentioned conditional statements (If statements), which are not entered as a result become. Other programming languages ​​(e.g. Object Pascal , PHP , JavaScript , Perl , Python , etc.), in turn, convert the fraction or quotient back into a binary number as the very first step and only then compare the two values, which in this case makes both conditions true and the expenses indicated above are made. ${\ displaystyle n_ {1} n_ {2} \ dots n_ {m}}$${\ displaystyle 10 ^ {\ left (n_ {m} \ right)}}$${\ displaystyle m}$

Even numbers with exactly the same bit pattern and thus actually exactly identical values ​​are sometimes not considered to be the same by the computer . The reason for this is the sometimes not identical formats in the memory (e.g. Intel 64 bit) and during a calculation in the floating point unit (e.g. Intel 80 bit). If the same bit patterns to be compared come once from the memory and thus rounded and once from the FPU and thus with full accuracy, a comparison leads to the wrong result. The remedy is the same as already described. This problem can also arise with larger / smaller comparisons. Depending on the language and architecture used, special commands and / or a detour via the main memory must be taken to solve this. ${\ displaystyle x = y}$

### Hidden use of other representations

Some computing systems use several different formats when computing. With Intel and related processors such as AMD, the FPU calculates with an 80-bit format. The numbers are saved in an IEEE-754-compatible 64-bit or 32-bit format. When using MMX / SSE extensions, other calculation formats are used. This leads to further properties that are initially very opaque for laypeople. A simple comparison of identical bit patterns for equality can lead to the result that the apparently identical bit patterns are different. The following program sometimes produces paradoxical results when called with the same value for x and y:

void vergleiche (double x, double y) {
if(x != cos (y))
else
printf("wuerde jeder so erwarten\n");
}
...
double z = 0.2; // der Wert ist nicht wichtig
vergleiche (cos (z), z);

 mit dem Ergebnis:  paradox


The explanation for this behavior is that the compiler generates two independent cos calculations, one before calling Compare and the other in Compare . The variable x receives the cos (z) with 64 bits. At Intel, the cos (y) can be calculated with 80 bits; the two results are different if cos (y) is compared with the 64-bit variable x not in memory but directly from the 80-bit working register.

## Binary floating point numbers in digital technology

The above examples are all given in the decimal system , i.e. with a base b  = 10. Computers use the binary system with a base b  = 2 instead .

### Single and double precision

Floating point numbers are normally as sequences of 32 in computers  bit ( single precision , English single precision ) and 64 bit ( double precision , double precision ) shown.

Some processors also allow long floating point numbers, so by the familiar Intel - x86 series processors derived (inter alia Intel.. Pentium and AMD Athlon ) a Gleitkommazahldarstellung with 80 bits for intermediate results. Some systems also allow floating point numbers with 128 bits ( fourfold precision ) . Some older systems also used other lengths such as 36 bits ( IBM 704 , PDP-10 , UNIVAC 1100/2200 series ), 48 bits (Burroughs) or 60 bits ( CDC 6600 ),

In addition, there are also systems known as minifloats with very few bits (around 8 or 16) that are used in low-memory systems ( controllers ) or limited data streams (e.g. graphics cards).

### IEEE 754 and other standards

The most common and well-known floating point system today was conceived by IEEE in 1985 , laid down in IEEE 754 , and is available in most computers as hardware or software arithmetic . IEEE 854 is a standard for floating point decimal numbers or decimal fractions . Both standards are merged and expanded in the IEEE 754r revision .

The IEEE has regulated the representation of floating point numbers in its IEEE 754 standard since 1985; almost all modern processors follow this standard. Counterexamples, the 754 does not meet the specifications of the IEEE, some IBM - mainframe systems (Hexfloat format), the VAX architecture and some supercomputers like those of Cray . The Java language is based closely on IEEE 754, but does not completely meet the standard.

The definition of the hex float format from IBM can be found in the book "Principles of Operation" of the z architecture .

The Power6 from IBM is one of the first processors to implement decimal floating point arithmetic in hardware; the base is therefore 10. In the following, only base 2 is dealt with.

Strictly speaking, only the normalized numbers from IEEE 754 floating point numbers are. The denormalized numbers are actually fixed-point numbers; these special cases were created for special numerical purposes.

### Internal representation

The actual representation in the computer therefore consists of a sign bit, some exponent bits and some mantissa bits. Whereby the mantissa is mostly normalized and numbers in the interval [1; 2 [represents. (Since the first bit with the value one is always set in this interval , it is usually assumed implicitly and not saved, see Hidden Bit ). The exponent is usually presented in biased format or in two's complement . Furthermore, to represent special values ​​(zero, infinite, no number), some exponent values, for example the largest possible and the smallest possible exponent, are usually reserved.

A number f is therefore represented as f = s · m · 2 e , where s is 1 or −1.

IEEE 754 S / 390
Mantissa (in bits) Exponent (in bits) Mantissa (in bits) Exponent (in bits)
Helped 10 5
single 23 8th 24 7th
Double 52 11 56 7th
Extended not exactly defined 112 7th

### Limitations and their mathematical basis

The different binary representation of the numbers can lead to artifacts in both systems. That means: Rational numbers that appear "round" in the decimal system, for example , cannot be represented exactly in the binary system (the value is ). Instead, their binary representation is rounded within the scope of the respective calculation accuracy, so that when converting back to the decimal system, e.g. B. receives the value 12.44999999900468785. This can lead to unforeseen rounding up or down errors in subsequent calculations. ${\ displaystyle \ textstyle {\ frac {249} {20}} = 12 {,} 45}$${\ displaystyle 1100 {,} 01 {\ overline {1100}} _ {2}}$

The artifacts mentioned above are inevitable in the binary system, since an infinite number of numbers that can be represented exactly in the decimal system are periodic numbers in the binary system with an infinite number of decimal places. They could only be avoided by using encodings with base 10 (or other bases of the form with any ), see also BCD code . However, binary floating point numbers are still used for a variety of reasons. ${\ displaystyle 10 \ cdot n}$${\ displaystyle n \ in \ mathbb {N}}$

In general, there is an infinite number of rational numbers for every base d , which have a finite representation (0-period) for another base and an infinite representation with a period for base d . The only distinguishing feature of the decimal system here is that people are used to it, and therefore the decimal system is often preferred for the input and output format of invoices.

In mathematics, a floating point number system is a tuple , where the base, the range of the exponent and the length of the mantissa represent. ${\ displaystyle \ left (d, [e _ {\ mathrm {min}}, e _ {\ mathrm {max}}], l \ right)}$${\ displaystyle d}$${\ displaystyle \ left [e _ {\ mathrm {min}}, e _ {\ mathrm {max}} \ right]}$${\ displaystyle l}$

So a real number x ≠ 0 can be represented by an a and an e , so that: and with . ${\ displaystyle a = \ sum _ {i = 1} ^ {l} a_ {i} \ cdot d ^ {- i}}$${\ displaystyle x = ad ^ {e}}$${\ displaystyle e \ in \ left [e _ {\ mathrm {min}}, e _ {\ mathrm {max}} \ right]}$

This enables a mathematical consideration of the rounding error. The above representation realizes a projection

${\ displaystyle fl \ colon \ mathbb {R} \ to \ {x \ in \ mathbb {R} \ mid \ exists a, e \ colon x = ad ^ {e} \}}$

and so the rounding error is defined as

${\ displaystyle {\ frac {| x-fl (x) |} {| x |}} \ leq \ varepsilon: = {\ frac {1} {2}} d ^ {1-l}.}$

For double values, even equals (approximately ). ${\ displaystyle \ varepsilon}$${\ displaystyle 2 ^ {- 53}}$${\ displaystyle 1 {,} 1 \ cdot 10 ^ {- 16}}$

### Example: Calculation of floating point number

18.4 10 is to be converted into a floating point number, we use the single IEEE standard (IEEE 754, binary32).

1. Calculation of excess

(The excess or bias is a constant belonging to the number standard. For this purpose, the bits that are reserved for the exponent in the number representation count, i.e. 8 digits in the IEEE 754 standard.)

Exzess = 2(n-1) - 1
(n Bits des Exponenten in der Zahlendarstellung)
= 2(8-1) - 1
= (27) - 1
= 128 - 1
= 127


2. Conversion of a decimal fraction into a dual fixed point number without a sign

Gleitkommazahl = 18,4

Vorkommaanteil = 18
18 / 2 = 9 Rest 0 (Least-Significant Bit)
9 / 2 = 4 Rest 1
4 / 2 = 2 Rest 0
2 / 2 = 1 Rest 0
1 / 2 = 0 Rest 1 (Most-Significant-Bit)
= 10010

Nachkommaanteil = 0.4
0,4 * 2 = 0,8 → 0 (Most-Significant-Bit)
0,8 * 2 = 1,6 → 1
0,6 * 2 = 1,2 → 1
0,2 * 2 = 0,4 → 0
0,4 * 2 = 0,8 → 0
0,8 * 2 = 1,6 → 1 (Least-Significant-Bit)
•
•
•
= 0,0110011001100110011...

18,4 = 10010,011001100110011...


3. Normalize

10010,01100110011... * 2^0 = 1,001001100110011... * 2^4


4. Calculation of the dual exponent

da 2^4 → Exponent = 4
Exponent + Exzess
4 + 127 = 131

131/2 = 65 Rest 1 (Least-Significant-Bit)
65/2 = 32 Rest 1
32/2 = 16 Rest 0
16/2 = 8  Rest 0
8/2 = 4  Rest 0
4/2 = 2  Rest 0
2/2 = 1  Rest 0
1/2 = 0  Rest 1 (Most-Significant-Bit)
= 10000011


5. Determine the sign bit

The sign is calculated from the formula (-1) ^ s:

 positiv → 0
negativ → 1
= 0


6. Form the floating point number

1 Bit Vorzeichen + 8 Bit Exponent + 23 Bit Mantisse
0 10000011 00100110011001100110011
→ die Vorkomma-Eins wird als Hidden Bit weggelassen;
da dort immer eine 1 steht,
braucht man diese nicht zu speichern


### Calculation of an IEEE single precision floating point number (32-bit floating point number)

Here the exact calculation steps are presented to convert a decimal fraction into a binary floating point number of the type single according to IEEE 754. To do this, the three values ​​( sign (1 bit), mantissa and exponent ) that make up the number must be calculated one after the other : ${\ displaystyle v}$ ${\ displaystyle m}$${\ displaystyle e}$${\ displaystyle x}$

${\ displaystyle x = (- 1) ^ {v} \ cdot m \ cdot 2 ^ {e}}$

sign

Depending on whether the number is positive or negative, it is 0 or 1: or${\ displaystyle v}$${\ displaystyle (-1) ^ {0} = 1}$${\ displaystyle (-1) ^ {1} = - 1}$

All further calculations are made with the amount of the number.

exponent

Next, the exponent is saved. With the IEEE single data type, 8 bits are provided for this. The exponent must be chosen so that the mantissa has a value between 1 and 2:

${\ displaystyle e = \ left \ lfloor \ log _ {2} (| x |) \ right \ rfloor}$

If the result is a value for the exponent that is less than −126 or greater than 127, the number cannot be saved with this data type. Instead, the number is saved as 0 (zero) or as "infinite".

However, the value for the exponent is not saved directly, but increased by a bias value to avoid negative values. With IEEE single , the bias value is 127. This means that the exponent values ​​−126 ... + 127 are saved as a so-called "characteristic" between 1 ... 254. The values ​​0 and 255 as characteristics are reserved for the special numerical values ​​" Zero ", " Infinite " and " NaN ".

mantissa

The mantissa is now stored in the remaining 23 bits:

${\ displaystyle m = \ left ({\ frac {| x |} {2 ^ {e}}} - 1 \ right) \ cdot 2 ^ {23}}$

Numerical example with the number 11.25

Number = +11.25

Sign = + → 0 binary

${\ displaystyle {\ text {Exponent}} = \ left \ lfloor \ log _ {2} (11 {,} 25) \ right \ rfloor = \ left \ lfloor 3 {,} 49 \ right \ rfloor = 3}$→ 3 + 127 = 130 → 10000010 binary

${\ displaystyle {\ text {Mantissa}} = \ left ({\ frac {11 {,} 25} {2 ^ {3}}} - 1 \ right) \ cdot 2 ^ {23} = (1 {,} 40625-1) \ cdot 2 ^ {23} = 3407872}$→ 01101000000000000000000 binary

This results in the following floating point number of single precision:

0 10000010 01101000000000000000000

reversal

If you want to calculate a decimal number from a floating point number in the machine word (32 bit), you can do this very quickly with the following formula:

${\ displaystyle Z = (- 1) ^ {VZ} \ cdot (1 {,} 0 + M / 2 ^ {23}) \ cdot 2 ^ {E-127}}$

### Calculation of an IEEE double precision floating point number (64-bit floating point number)

Reversal:

The following formula can be used to calculate a decimal number from a floating point number in the machine word (64 bit):

${\ displaystyle Z = (- 1) ^ {VZ} \ cdot (1 {,} 0 + M / 2 ^ {52}) \ cdot 2 ^ {E-1023}}$

Example:

The following 64-bit binary number should be interpreted as a floating point number:

0 10001110100 0000101001000111101011101111111011000101001101001001

(the leftmost bit is bit 63 and the rightmost bit is bit 0)

Bit 63 represents the sign (1 bit), i.e.

  VZ = 0binär = 0


Bit 62 to 52 represent the exponent (11 bits), so:

 E = 10001110100binär = 1140


Bit 51 to 0 represent the mantissa (52 bits), i.e.:

  M = 0000101001000111101011101111111011000101001101001001binär = ${\displaystyle 180847918207817}$


Inserted into the formula results in the result (rounding values):

 ${\displaystyle Z=(-1)^{0}\cdot (1{,}0+180847918207817/2^{52})\cdot 2^{1140-1023}=1\cdot (1{,}040156304549969)\cdot 2^{117}=1{,}7282561\cdot 10^{35}}$


## Remarks

1. Namely all rational numbers that can be represented with a power of ten as a denominator, but not with a power of two.