Unum (number format)

Numbers that are represented on a base other than 10 , especially binary , are notated with the base behind them; the base itself is always noted in decimal. So is . The index 2 does not apply if bit patterns or the like. the question is, so the numerical value is not in the foreground. Bit patterns are shown in typewriter font if technically possible.

{\ displaystyle 1011_ {2} = 11_ {10} = 11}

Universal numbers , or Unum for short, are a floating-point number format developed by John L. Gustafson , known through Gustafson's Law , as an alternative to the predominant IEEE ‑ 754 floating point numbers . The first version of Unums is retrospectively referred to as Type I , when Type II and later Type III were released.

"Some people are surprised that π can be represented as an exact number, but of course it can, which is one reason for the term 'universal number.'"

"Some are surprised that π can be represented exactly, but of course it can, and it is one of the reasons for the term universal number ."

- John L. Gustafson : A Radical Approach to Computation with Real Numbers

The main internal difference to IEEE ‑ 754 floating point numbers is the variable format. While the IEEE ‑ 754 standard stipulates a fixed number of exponent and mantissa bits - only depending on the number of bits - these are variable with Unums. All Unum formats have in common that arithmetic overflows and underflows do not degenerate into infinity, but remain at the smallest or largest intervals or numbers. According to Gustafson, the numerical error of sliding from finite results to infinity is infinite, whereas assigning a finite value is potentially a large but finite error.

Numerical model

Representation of the projective straight line as a circle with bottom, top, left and right

{\ displaystyle 0}

{\ displaystyle \ infty}

{\ displaystyle -1}

{\ displaystyle +1}

The mathematical number model behind Type II and Type III Unums is the projective straight line , i.e. the real numbers together with an infinitely distant point that is reached by numbers with a large amount, regardless of whether they are negative or positive. How is unsigned and is reached from the negative and positive sides alike. In the practice of floating point numbers , infinite and undefined values are basically special cases ; certain metrics that indicate relative accuracy, for example, are not meaningfully applicable to this. Unums is used for undefined values. Operations on finite numbers that are mathematically well-defined never produce results. Thus, more corresponds to a quiet NaN from IEEE 754 than its infinite values. For example, if the value range is exceeded by adding two large numbers, IEEE ‑ 754 floating point numbers round off ; In Type I and Type II Unums the result is an object that describes the open interval between the largest number that can be exactly represented and . Type III unums do not represent intervals, they then remain at the largest finite value. ${\ displaystyle \ infty}$ ${\ displaystyle 0}$ ${\ displaystyle \ infty}$ ${\ displaystyle 0}$ ${\ displaystyle \ infty}$ ${\ displaystyle \ infty}$ ${\ displaystyle \ infty}$ ${\ displaystyle + \ infty}$ ${\ displaystyle + \ infty}$

The number model behind Type I-Unums and IEEE 754, on the other hand, is the closed real line, i.e. the real numbers together with separate values for and . There are also undefined values, including NaN for English. not called a number . IEEE 754 also distinguishes between the numbers and . ${\ displaystyle - \ infty}$ ${\ displaystyle + \ infty}$ ${\ displaystyle +0}$ ${\ displaystyle -0}$

Basically, there is a clear, precisely represented negative for every Unum and a clear, precisely represented reciprocal value for Type II Unums , whereas such a reciprocal value for Type III Unums is only guaranteed for and powers of two. In the usual representation of the projective straight line (see picture), the negation corresponds to the reflection on the vertical axis and the reciprocal value formation on the horizontal axis; especially applies and . ${\ displaystyle x}$ ${\ displaystyle -x}$ ${\ displaystyle x ^ {- 1}}$ ${\ displaystyle 0, \ infty}$ ${\ displaystyle 0 ^ {- 1} = \ infty}$ ${\ displaystyle \ infty ^ {- 1} = 0}$

Type I.

The first format from Unums, called Type I in retrospect , was designed as an extension of the IEEE ‑ 754 floating point numbers. It uses a ubit ( u for unprecise , imprecise) at the end, which indicates whether the number is precise or an open interval between adjacent floating point numbers ( ). The layout is the same as for IEEE ‑ 754 floating point numbers, but the number of exponent and mantissa bits varies automatically or both can be freely selected. Type I Unums are a simple solution to interval arithmetic . ${\ displaystyle \ textstyle {ubit} = 0}$ ${\ displaystyle \ textstyle {ubit} = 1}$

Gustafson developed Type-II-Unums from the experience that the IEEE-754 compatibility is of little use, but causes a lot of effort, whereas the part with interval arithmetic could be implemented efficiently.

Type II

The second version of Unums discards the compatibility with IEEE ‑ 754 floating point numbers in favor of greater flexibility. The main finding was that signed whole numbers can be mapped well on the projective straight line, whereby the overflow is consciously accepted, but the order relation is retained. In particular, in signed integer formats with two's complement representation of negative numbers, the smallest negative number (the one with the maximum amount) has the property of being its own negative. As an integer, this marginal case is occasionally an obstacle, but this property is natural and desirable for the infinitely distant point in the projective straight line.

Like Type I Unums, Type II Unums interpret the last bit in the representation as ubit , which represents an open interval between exact numbers. These intervals can be collapsed at the end of the calculation, for example to their arithmetic or geometric center .

For an n-bit Type II Unum format, real, not necessarily rational, numbers greater than selected and placed in the first quadrant on the projective straight line. The other quadrants result from the formation of negative and reciprocal values. An n-bit word has states, half of which are occupied by intervals and any three of the four quadrants result from the fourth. Since the choice is fixed, there is one option less than . ${\ displaystyle 2 ^ {n-3} -1}$ ${\ displaystyle 1}$ ${\ displaystyle 2 ^ {n}}$ ${\ displaystyle 1}$ ${\ displaystyle 2 ^ {n-3}}$

An example of a decimal 8-bit format: Wanted is the exact representation of the numbers , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , . The numbers given as inverses are wishes for the fourth quadrant, expressed in a way that fits the scheme. ${\ displaystyle 31}$ ${\ displaystyle 10 ^ {100}}$ ${\ displaystyle 10 ^ {50}}$ ${\ displaystyle 10 ^ {20}}$ ${\ displaystyle 10 ^ {10}}$ ${\ displaystyle 10 ^ {6}}$ ${\ displaystyle 10 ^ {4}}$ ${\ displaystyle 1000}$ ${\ displaystyle 500}$ ${\ displaystyle 200}$ ${\ displaystyle 100}$ ${\ displaystyle 80}$ ${\ displaystyle 50}$ ${\ displaystyle 40}$ ${\ displaystyle 25}$ ${\ displaystyle 20}$ ${\ displaystyle 12 {,} 5}$ ${\ displaystyle 10}$ ${\ displaystyle 9}$ ${\ displaystyle 8}$ ${\ displaystyle 7}$ ${\ displaystyle 6}$ ${\ displaystyle 5}$ ${\ displaystyle 4}$ ${\ displaystyle 0 {,} 3 ^ {- 1}}$ ${\ displaystyle 3}$ ${\ displaystyle 2 {,} 5}$ ${\ displaystyle 2}$ ${\ displaystyle 0 {,} 6 ^ {- 1}}$ ${\ displaystyle 0 {,} 7 ^ {- 1}}$ ${\ displaystyle 1 {,} 25}$ ${\ displaystyle 0 {,} 9 ^ {- 1}}$

The operations for Type II Unums can generally only be implemented by a table of values, which makes them slow for bit numbers greater than approx. 20, since tables of values for two-digit operations grow quadratically.

Gustafson developed type III Unums, as type II Unums cumbersome (engl. For fused fused ) operations such as fused multiply-add turned out. The aim of Type III Unums is therefore to implement the advantages of Type II Unums in a software and hardware- friendly manner.

SoRNs

Since the formats of type I and II allow the number ranges to be flexible and application-specific, small bit lengths are often sufficient. In the case of 8 bits, a 256 ‑ bit word can represent any combination of intervals (since 2 ⁸ = 256, see cardinality of the power set ). In this context, the exact numbers are seen as a set of ones with this same number and as the set of pairs , but for the sake of simplicity they are also referred to as an interval. A word interpreted in this way is called SoRN for English. Set of Real Numbers , in German (partial) set of real numbers . The i ‑ th bit indicates whether the interval with the representation is a subset of the SoRN. As the result of an operation, a SoRN means that it contains the mathematically exact solution. Assuming a correct implementation, the statement is true in the mathematical sense and not an approximation with unknown error bounds. ${\ displaystyle \ infty}$ ${\ displaystyle \ {- \ infty, + \ infty \}}$ ${\ displaystyle i}$

SoRNs are usually specified as an interval or as a union of intervals if there are gaps. For the sake of clarity, the lower limit and the upper limit are written here. In particular, the interval represents the set of real numbers. Otherwise, the empty set is an open interval . ${\ displaystyle - \ infty}$ ${\ displaystyle + \ infty}$ ${\ displaystyle (- \ infty, + \ infty)}$ ${\ displaystyle (a, a)}$

Compared to interval arithmetic, SoRNs are also benign if otherwise the erasure of positions or, due to undetected dependencies, each iteration step leads to an exponential loss of accuracy. If one does not start with a one-element interval , for example, every sequence of intervals according to traditional interval arithmetic that results from the repeated application of the operations or diverges (that is, the intervals become larger and larger and therefore useless as an estimate), whereas the intervals described by SoRNs steadily get smaller and nail down the result. ${\ displaystyle xx}$ ${\ displaystyle x / x}$

arithmetic

The arithmetic of the Unums can be continued on ZONs: If and are two ZONs, then is the ZON, which consists of the Unums with from and from . Multiplication is defined analogously; Subtraction and division are found to be according to and accordingly according to . The one-digit operations and are applied to the individual elements. ${\ displaystyle U}$ ${\ displaystyle V}$ ${\ displaystyle U + V}$ ${\ displaystyle u + v}$ ${\ displaystyle u}$ ${\ displaystyle U}$ ${\ displaystyle v}$ ${\ displaystyle V}$ ${\ displaystyle UV}$ ${\ displaystyle U + (- V)}$ ${\ displaystyle U / V}$ ${\ displaystyle U \ cdot V ^ {- 1}}$ ${\ displaystyle -V}$ ${\ displaystyle V ^ {- 1}}$

Operations are also possible with SoRNs, the result of which is not always a single number, even in a mathematical sense. For example, the square root function is undefined for negative inputs. As an operation on SoRNs, the empty SoRN accordingly delivers. In principle, it is possible to return the empty result for any undefined operation, but in certain respects other results are often better. For example, the answer to question is: What number solves the equation ? For and this answer is: any number. Hence the full ZON results. The same applies to and (which must be the same, there ). An example of an undefined operation that delivers neither the empty nor the full SoRN is with the solution set , i.e. a SoRN that includes , and all positive intervals. ${\ displaystyle {\ sqrt {-1}}}$ ${\ displaystyle a / b}$ ${\ displaystyle x}$ ${\ displaystyle a = b \ cdot x}$ ${\ displaystyle a = 0}$ ${\ displaystyle b = 0}$ ${\ displaystyle 0/0}$ ${\ displaystyle \ infty + \ infty}$ ${\ displaystyle \ infty - \ infty}$ ${\ displaystyle + \ infty = - \ infty}$ ${\ displaystyle 1 ^ {\ infty}}$ ${\ displaystyle \ lbrack 0, \ infty \ rbrack}$ ${\ displaystyle \ lbrace \ pm \ infty \ rbrace}$ ${\ displaystyle \ lbrace 0 \ rbrace}$

For example, given the ZON . The logarithm of this (regardless of the base) gives . The square of this result is . ${\ displaystyle (0, + \ infty)}$ ${\ displaystyle (- \ infty, + \ infty)}$ ${\ displaystyle \ lbrack 0, + \ infty)}$

Another example is the function with . What ? A calculation with Unums results . Now it is so that also is, whereby the definition gap at 0 disappears, and the result is confirmed. But what if the input was not exactly known, but is given as the ZON ? If the first representation of is used, the operations on 4-bit Unums deliver exact results and the result SoRN represents the statement . For the second term with traditional interval arithmetic, the input interval is first to (traditional interval arithmetic because only works with closed intervals) completed. The result is sobering . In the second representation occurs twice; that it must be the same value twice (within the limits) and not two independent values with the same limits, but is not taken into account in the mechanical evaluation. ${\ displaystyle f}$ ${\ displaystyle f (x) = {\ tfrac {1} {{\ tfrac {1} {x}} + {\ tfrac {1} {2}}}}}$ ${\ displaystyle f (0)}$ ${\ displaystyle {\ tfrac {1} {{\ tfrac {1} {0}} + {\ tfrac {1} {2}}}} = {\ tfrac {1} {\ infty + {\ tfrac {1} {2}}}} = {\ tfrac {1} {\ infty}} = 0}$ ${\ displaystyle f (x) = {\ tfrac {2x} {2 + x}}}$ ${\ displaystyle (-1.2 \ rbrack}$ ${\ displaystyle f}$ ${\ displaystyle -2 <f (x) \ leq 1}$ ${\ displaystyle (-1.2 \ rbrack}$ ${\ displaystyle \ lbrack -1,2 \ rbrack}$ ${\ displaystyle \ lbrack -2,4 \ rbrack}$ ${\ displaystyle x}$

Run length coding

A SoRN is considered to be connected if all ones are packed next to each other in its representation (includes the empty SoRN). Because the number line is at both ends, but is only coded once in the SoRN, the SoRN is not connected in the mathematical sense (more precisely: in the sense of the projective straight line the set is already connected, but not in the sense of a number line with end points) but the ZON that it represents. The basic arithmetic operations and many others like the root get connected SoRNs, that is, for connected SoRNs as inputs, they deliver connected SoRNs as a result. A so-called run-length encoding is therefore suitable . Instead of any SoRNs, only related ones are shown. The run length coding stores the beginning and the length of the ones column. For 256 digits, 8 bits are sufficient for the beginning and the length, so the run length coding requires a total of 16 bits. Only the full ZON is contiguous, but strictly speaking, it cannot be represented in this way, since its length (256 ones) just exceeds the maximum (255) in the length representation. This can be remedied, since formally several representations exist for the empty ZON, namely all with length 0. One of them is reinterpreted for the full ZON. ${\ displaystyle \ infty}$ ${\ displaystyle [- \ infty, 0] \ cup [1, + \ infty]}$

In particular, run-length coding for SoRNs is more effective the higher the number of bits. For 16 ‑ bit Unums, a general SoRN would be 65,536 bits wide, i.e. almost 8 KiB in size. The run length coding only needs 32 bits.

example

A 4-bit Type II Unum is used for the example, with 2 being the only number that can be selected. There are 16 intervals , , , , , , , , , , , , , , and . A subset word thus consists of bits. ${\ displaystyle \ lbrace \ pm \ infty \ rbrace}$ ${\ displaystyle (- \ infty, -2)}$ ${\ displaystyle \ lbrace -2 \ rbrace}$ ${\ displaystyle (-2, -1)}$ ${\ displaystyle \ lbrace -1 \ rbrace}$ ${\ displaystyle (-1, - {\ tfrac {1} {2}})}$ ${\ displaystyle \ lbrace - {\ tfrac {1} {2}} \ rbrace}$ ${\ displaystyle (- {\ tfrac {1} {2}}, 0)}$ ${\ displaystyle \ lbrace 0 \ rbrace}$ ${\ displaystyle (0, + {\ tfrac {1} {2}})}$ ${\ displaystyle \ lbrace + {\ tfrac {1} {2}} \ rbrace}$ ${\ displaystyle (+ {\ tfrac {1} {2}}, + 1)}$ ${\ displaystyle \ lbrace +1 \ rbrace}$ ${\ displaystyle (+ 1, + 2)}$ ${\ displaystyle \ lbrace +2 \ rbrace}$ ${\ displaystyle (+2, + \ infty)}$ ${\ displaystyle 2 ^ {4} = 16}$

In general, the run-length coding of bits requires exactly bits. ${\ displaystyle 2 ^ {n}}$ ${\ displaystyle 2n}$

Type III

The Type III Unum system is also known as Posits and Valids . The English word posit means postulate . The posits are the actual numbers that are used to calculate; the Valids define intervals with limits on positions of the same format. The primary purpose of Valids is error estimation, which is often within usable limits due to the high error tolerance of the posit limits. The big advantage of interval arithmetic is (typically) that the true result is guaranteed to be within the interval limits. Interval arithmetic therefore always runs the risk that the lower and upper interval limits are so far apart that the statement is correct, but the result is practically no longer meaningful.

IEEE floating point numbers allow errors in the evaluation of the transcendent functions prescribed by the standard , which go beyond the rounding error, mainly so that the value tables in the hardware do not become too large. Posits do not make such concessions. On the contrary: implementations of type III Unums must offer a scalar product operation (also called fused dot product ), the result of which is calculated precisely to the last bit. The operations Fused multiply-add and Fused sum are special cases of the scalar product. ( Precise means within the rounding accuracy.) The scalar product is also not restricted in length; Implementations have to expect any number of summands. A buffer called Quire is used for this. Suitable Quire sizes are listed in a table below .

The posits consist of a sign bit (engl. Sign ), a variable number of regimes bits that determine gross order of magnitude (and for which no IEEE-754 equivalent exists), the exponent bits and the mantissa bits.

The number of regime, exponent and mantissa bits varies. The determinants of a Unum format are its total bits (in practice usually a power of two) and its maximum exponent bits . One- bit unum allows after subtracting the sign bit for a concrete positive between and regime bits. Due to the construction, the number of regime bits must be at least (for construction, see below ). If the regime takes up more bits than remain according to total bits minus , the number of exponent bits is subordinate, correspondingly reduced and smaller than . Any remaining bits are allotted to the mantissa. ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle n}$ ${\ displaystyle 2}$ ${\ displaystyle n-1}$ ${\ displaystyle 2}$ ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle E _ {\ text {max}}}$

Sample format

Given the format with bits and . If such a posit has regime bits, the result is a 1: 4: 2: 9 distribution of sign, regime, exponent and mantissa. If there are 12 regime bits, the distribution is 1: 12: 2: 1 and if there are 14 regime bits, 1: 14: 1: 0. ${\ displaystyle 16}$ ${\ displaystyle E _ {\ text {max}} = 2}$ ${\ displaystyle 4}$

Key figures

The number of total bits and the maximum exponent bits result in the key figure des ( u for Unum and English seed for seeds or germ ) as . For a Unum format with up to exponent bits . With up to 3 exponent bits this is and with already ${\ displaystyle useed}$ ${\ displaystyle useed = 2 ^ {E _ {\ text {max}}}}$ ${\ displaystyle 2}$ ${\ displaystyle useed = 2 ^ {2 ^ {2}} = 2 ^ {4} = 16}$ ${\ displaystyle useed = 265}$ ${\ displaystyle 4}$ ${\ displaystyle useed = 65536}$

In contrast to IEEE 754, Unums have no exponent bias or a comparable size.

Set of numbers

Since the negative numbers and the bit patterns of their representations emerge from the positive ones, only these need to be described. The set of Unums is generated iteratively. Starting from the simplest Unum format consisting of -bit-Unums, each consisting of only a sign (1 bit) and regime (min. Bit, see below for a reason), higher bits are generated by adding bits. With is the largest positive and with the smallest positive number in the current format. While it is true , the equality only applies if they are powers of two. ${\ displaystyle 3}$ ${\ displaystyle 2}$ ${\ displaystyle max}$ ${\ displaystyle min}$ ${\ displaystyle min \ approx {\ tfrac {1} {max}}}$

The set of -Bit-Unums consists of elements: and , and are fixed; in addition there is dependent , together with , and . Even if the -bit format inevitably does not allow exponent bits because the regime bits "always eat everything up", they can be set as desired; this also makes sense, since the iterative representation of the number set remains the same, only the number of bits is increased. The 3 ‑ bit unum is only the starting point of the construction. In is bit format . ${\ displaystyle 3}$ ${\ displaystyle 8}$ ${\ displaystyle 0}$ ${\ displaystyle \ infty}$ ${\ displaystyle -1}$ ${\ displaystyle +1}$ ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle useed}$ ${\ displaystyle -useed}$ ${\ displaystyle useed ^ {- 1}}$ ${\ displaystyle -useed ^ {- 1}}$ ${\ displaystyle 3}$ ${\ displaystyle E _ {\ text {max}} \ leq 0}$ ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle 3}$ ${\ displaystyle max = useed}$

For the step from ‑Bit-Unum to ‑Bit-Unum, a bit is added to each representation. If this bit is 0 , the represented value remains the same, regardless of whether the bit is assigned to the regime, the exponent or the mantissa. If the bit is 1 , the bit pattern represents a new number that lies between the two numbers whose representations, interpreted as a whole number, are larger and smaller: ${\ displaystyle n}$ ${\ displaystyle n + 1}$ ${\ displaystyle 1}$

Between and is inserted and correspondingly between and . The added bit becomes part of the regime.

{\ displaystyle max}

{\ displaystyle \ infty}

{\ displaystyle max \ cdot useed}

{\ displaystyle max / useed}

{\ displaystyle min}

{\ displaystyle 0}

Between and , wherein and as more apart, which is geometric center of and inserted: . The added bit becomes part of the exponent. ${\ displaystyle 2 ^ {m}}$ ${\ displaystyle 2 ^ {n}}$ ${\ displaystyle m}$ ${\ displaystyle n}$ ${\ displaystyle 1}$ ${\ displaystyle 2 ^ {m}}$ ${\ displaystyle 2 ^ {n}}$ ${\ displaystyle 2 ^ {(m + n) / 2}}$

Otherwise, their arithmetic mean is added between and . The added bit becomes part of the mantissa. ${\ displaystyle x}$ ${\ displaystyle y}$ ${\ displaystyle {\ tfrac {x + y} {2}}}$

The first point adds a lot of exponentially spaced numbers on the extreme edges. In a sense, they approximate and therefore well. The last point inserts a lot of linearly spaced numbers around . Exponentially or linearly apart means: If they are consecutive, then is in the first case and in the second case . Due to the design, the sum in the second point is an even whole number, half of which is therefore an integer. ${\ displaystyle 0}$ ${\ displaystyle \ infty}$ ${\ displaystyle 1}$ ${\ displaystyle x <y <z}$ ${\ displaystyle z = y \ cdot {\ tfrac {y} {x}}}$ ${\ displaystyle z = y + (yx)}$ ${\ displaystyle m + n}$

The first construction steps for the positive positions with , i.e. with , are: ${\ displaystyle E _ {\ text {max}} = 1}$ ${\ displaystyle useed = 2 ^ {4}}$

Output format with bits ${\ displaystyle 3}$

{\ displaystyle \ lbrace 0,1 / useed, 1, useed, \ infty \ rbrace}

{\ displaystyle = \ lbrace 0.2 ^ {- 4}, 2 ^ {0}, 2 ^ {4}, \ infty \ rbrace}

; and .

{\ displaystyle min = 2 ^ {- 4}}

{\ displaystyle max = 2 ^ {4}}

Format with bits: ${\ displaystyle 4}$

{\ displaystyle \ lbrace 0, min / useed, 2 ^ {- 4}, 2 ^ {(- 4 + 0) / 2}, 2 ^ {0}, 2 ^ {(0 + 4) / 2}, 2 ^ {4}, max \ cdot useed, \ infty \ rbrace}

{\ displaystyle = \ lbrace 0.2 ^ {- 8}, 2 ^ {- 4}, 2 ^ {- 2}, 2 ^ {0}, 2 ^ {2}, 2 ^ {4}, 2 ^ { 8}, \ infty \ rbrace}

; and .

{\ displaystyle min = 2 ^ {- 8}}

{\ displaystyle max = 2 ^ {8}}

Format with bits: ${\ displaystyle 5}$

{\ displaystyle \ lbrace 0, min / useed, 2 ^ {- 8}, 2 ^ {(- 8-4) / 2}, 2 ^ {- 4}, 2 ^ {(- 4-2) / 2} , 2 ^ {- 2}, 2 ^ {(- 2 + 0) / 2}, 2 ^ {0}, 2 ^ {(0 + 2) / 2}, 2 ^ {2}, 2 ^ {(2 +4) / 2}, 2 ^ {4}, 2 ^ {(4 + 8) / 2}, 2 ^ {8}, max \ cdot useed, \ infty \ rbrace}

{\ displaystyle = \ lbrace 0.2 ^ {- 12}, 2 ^ {- 8}, 2 ^ {- 6}, 2 ^ {- 4}, 2 ^ {- 3}, 2 ^ {- 2}, 2 ^ {- 1}, 2 ^ {0}, 2 ^ {1}, 2 ^ {2}, 2 ^ {3}, 2 ^ {4}, 2 ^ {6}, 2 ^ {8}, 2 ^ {12}, \ infty \ rbrace}

; and .

{\ displaystyle min = 2 ^ {- 12}}

{\ displaystyle max = 2 ^ {12}}

Format with bits: ${\ displaystyle 6}$

{\ displaystyle \ lbrace 0, min / useed}

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

{\ displaystyle 2 ^ {- 12}}

{\ displaystyle 2 ^ {(- 12-8) / 2}}

{\ displaystyle 2 ^ {- 8}}

{\ displaystyle 2 ^ {(- 8-6) / 2}}

{\ displaystyle 2 ^ {- 6}}

{\ displaystyle 2 ^ {(- 6-4) / 2}}

{\ displaystyle 2 ^ {- 4}}

{\ displaystyle {\ tfrac {2 ^ {- 4} +2 ^ {- 3}} {2}}}

{\ displaystyle 2 ^ {- 3}}

{\ displaystyle {\ tfrac {2 ^ {- 3} +2 ^ {- 2}} {2}}}

{\ displaystyle 2 ^ {- 2}}

{\ displaystyle {\ tfrac {2 ^ {- 2} +2 ^ {- 1}} {2}}}

{\ displaystyle 2 ^ {- 1}}

{\ displaystyle {\ tfrac {2 ^ {- 1} + 2 ^ {0}} {2}}}

{\ displaystyle 2 ^ {0}}

{\ displaystyle {\ tfrac {2 ^ {0} + 2 ^ {1}} {2}}}

{\ displaystyle 2 ^ {1}}

{\ displaystyle {\ tfrac {2 ^ {1} + 2 ^ {2}} {2}}}

{\ displaystyle 2 ^ {2}}

{\ displaystyle {\ tfrac {2 ^ {2} + 2 ^ {3}} {2}},}

{\ displaystyle 2 ^ {3}}

{\ displaystyle {\ tfrac {2 ^ {3} + 2 ^ {4}} {2}}}

{\ displaystyle 2 ^ {4}}

{\ displaystyle 2 ^ {(4 + 6) / 2}}

{\ displaystyle 2 ^ {6}}

{\ displaystyle 2 ^ {(6 + 8) / 2}}

{\ displaystyle 2 ^ {8}}

{\ displaystyle 2 ^ {(8 + 12) / 2}}

{\ displaystyle 2 ^ {12}}

{\ displaystyle max \ cdot useed}

{\ displaystyle \ infty \ rbrace}

{\ displaystyle = \ lbrace 0}

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ; and .

{\ displaystyle 2 ^ {- 16}}

{\ displaystyle 2 ^ {- 12}}

{\ displaystyle 2 ^ {- 10}}

{\ displaystyle 2 ^ {- 8}}

{\ displaystyle 2 ^ {- 7}}

{\ displaystyle 2 ^ {- 6}}

{\ displaystyle 2 ^ {- 5}}

{\ displaystyle 2 ^ {- 4}}

{\ displaystyle 1 {\ tfrac {1} {2}} \ cdot 2 ^ {- 4}}

{\ displaystyle 2 ^ {- 3}}

{\ displaystyle 1 {\ tfrac {1} {2}} \ cdot 2 ^ {- 3}}

{\ displaystyle 2 ^ {- 2}}

{\ displaystyle 1 {\ tfrac {1} {2}} \ cdot 2 ^ {- 2}}

{\ displaystyle 2 ^ {- 1}}

{\ displaystyle 1 {\ tfrac {1} {2}} \ cdot 2 ^ {- 1}}

{\ displaystyle 2 ^ {0}}

{\ displaystyle 1 {\ tfrac {1} {2}}}

{\ displaystyle 2 ^ {1}}

{\ displaystyle 1 {\ tfrac {1} {2}} \ cdot 2 ^ {1}}

{\ displaystyle 2 ^ {2}}

{\ displaystyle 1 {\ tfrac {1} {2}} \ cdot 2 ^ {2}}

{\ displaystyle 2 ^ {3}}

{\ displaystyle 1 {\ tfrac {1} {2}} \ cdot 2 ^ {3}}

{\ displaystyle 2 ^ {4}}

{\ displaystyle 2 ^ {5}}

{\ displaystyle 2 ^ {6}}

{\ displaystyle 2 ^ {7}}

{\ displaystyle 2 ^ {8}}

{\ displaystyle 2 ^ {10}}

{\ displaystyle 2 ^ {12}}

{\ displaystyle 2 ^ {16}}

{\ displaystyle \ infty \ rbrace}

{\ displaystyle min = 2 ^ {- 16}}

{\ displaystyle max = 2 ^ {16}}

Interpretation of a bit pattern

With the exception of this , the order on postits should match the order on their representations, the representations being interpreted as signed whole numbers in two's complement representation. From the set of numbers and this assumption, it is clear which number is described by any Postits representation. Here is a clear, effective procedure. ${\ displaystyle \ infty}$

If the bit pattern consists of 00 … 0 , it represents ; if it consists of 10 … 0 , then it represents . In the following, these two cases are excluded. ${\ displaystyle 0}$ ${\ displaystyle \ infty}$

The sign is the first (= most significant) bit . As with IEEE 754, the interpretation is that for is positive and for is negative. Exceptions are the values and ; actually technically has the sign bit 0 and the sign bit 1, yet both are referred to as neither positive nor negative. In contrast to IEEE floating point numbers, posits do not store a sign and an amount. ${\ displaystyle s}$ ${\ displaystyle s = 0}$ ${\ displaystyle s = 1}$ ${\ displaystyle 0}$ ${\ displaystyle \ infty}$ ${\ displaystyle 0}$ ${\ displaystyle \ infty}$

The following considerations are given for . Is , then the number is made non-negative by two's complement before the key figures are determined. ${\ displaystyle s = 0}$ ${\ displaystyle s = 1}$

The run-length (engl. Run-length ) of a specific number is obtained from the regime bits. The regime follows the sign and consists of a sequence of the same binary digits, which is ended by the other binary digit or runs to the end. The regime always has the form 0 … 01 or 1 … 10 or, if the regime takes up the entire remainder of the number, 1 … 11 and the shortest possible forms are 01 , 10 and 11 . For a total length of 5, the possible regimes are between 2 and 4 bits long, specifically 0001 , 001 , 01 , 10 , 110 , 1110 and 1111 . If zeros lead the regime, then the number of these zeros is negative. For 0001 is etc. to 01 for . If, on the other hand, ones lead the regime, the number of these ones is therefore positive or zero. For 10 is for 110 is etc. to 1111 for . The regime cannot consist of only zeros, because then the bit pattern (depending on the sign bit) would be that for or (see above). ${\ displaystyle k}$ ${\ displaystyle -k}$ ${\ displaystyle k}$ ${\ displaystyle k = -3}$ ${\ displaystyle k = -1}$ ${\ displaystyle k + 1}$ ${\ displaystyle k}$ ${\ displaystyle k = 0,}$ ${\ displaystyle k = 1}$ ${\ displaystyle k = 3}$ ${\ displaystyle 0}$ ${\ displaystyle \ infty}$

It denotes the numerical value of the exponent bits, possibly supplemented by zeros, interpreted as an unsigned binary number. If the number of exponent bits is smaller than , fictitious zeros are added to the bit pattern so that the unsigned number consists of exactly binary digits. ${\ displaystyle e}$ ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle e}$ ${\ displaystyle E _ {\ text {max}}}$

If there are bits left in the representation that were not assigned, they are omitted from the mantissa. If there are mantissa bits and is , then the numerical value of the mantissa bits is interpreted as an unsigned binary number; otherwise set . ${\ displaystyle F}$ ${\ displaystyle F> 0}$ ${\ displaystyle f}$ ${\ displaystyle f = 0}$

For determining the numerical value of a Unum from its bit pattern it is

{\ displaystyle (-1) ^ {s} \ cdot useed ^ {k} \ cdot 2 ^ {e} \ cdot (1 + f / 2 ^ {F})}

.

The mantissa clearly contributes the factor (in binary representation), with the binary digits in the mantissa denoting. In the terms of IEEE 754, the hidden bit is always 1 and Unums generally have no subnormal numbers either. The value of is chosen so that applies. This also makes the representation clear. ${\ displaystyle 1, f_ {1} \ ldots f_ {F}}$ ${\ displaystyle f_ {i}}$ ${\ displaystyle useed}$ ${\ displaystyle 2 ^ {| e |} <useed}$

The run length very quickly generates very large or small amounts at the expense of the accuracy in the mantissa, but short regimes, that is run lengths close and thus numbers close , leave a lot of space for mantissa bits. As a result, compared to IEEE ‑ 754 floating point numbers, the same total number of bits can be used to represent larger numbers in terms of amount and to resolve numbers more precisely. ${\ displaystyle 0}$ ${\ displaystyle 1}$ ${\ displaystyle 1}$

Consequences of the representation

A posit is negated by negating its representation as an integer according to two's complement. Since the two's complement operation leaves both the representation of and the representation of the same, no special case treatment is required for their negation. ${\ displaystyle 0}$ ${\ displaystyle \ infty}$

Thanks to the two's complement representation, a greater / smaller comparison is possible without determining the key figures (provided the format, i.e. total width and , is the same). With the exception of , a positon is smaller or larger than another if their representations, interpreted as signed integers, have the same relation. The numbers do not have to be laboriously decoded. ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle \ infty}$

Sample bit pattern

Use a 16 ‑ bit unum with exponent length as the type . The number has the bit pattern 0 : 0001 : 101 : 11011101 , the colon separating the sign, regime, exponent and mantissa bits. ${\ displaystyle E _ {\ text {max}} = 3}$

Special cases 0 and : Not available. ${\ displaystyle \ infty}$
Determine . ${\ displaystyle useed = 2 ^ {2 ^ {3}} = 256}$
Sign 0 : . ${\ displaystyle s = 0}$
Walking distance 0001 : It lead three zeros, thus , d. h . ${\ displaystyle k}$ ${\ displaystyle -k = 3}$ ${\ displaystyle k = -3}$
Exponent 101 : . ${\ displaystyle e = 101_ {2} = 5_ {10}}$
Mantissa 11011101 : with . ${\ displaystyle f = 11011101_ {2} = 221_ {10}>}$ ${\ displaystyle F = 8}$

This results in

{\ displaystyle (-1) ^ {s} \ cdot useed ^ {k} \ cdot 2 ^ {e} \ cdot (1 + f / 2 ^ {F}) =}

{\ displaystyle (-1) ^ {0} \ cdot 256 ^ {- 3} \ cdot 2 ^ {5} \ cdot (1 + 221/2 ^ {8}) =}

{\ displaystyle 1 \ cdot 2 ^ {- 24} \ cdot 2 ^ {5} \ cdot (1 + 221/2 ^ {8})}

The last expression gives . ${\ displaystyle 3 {,} 553926944732666015625 \ cdot 10 ^ {- 6}}$

Table with example values

For another example, consider the 8-bit format and as well . Then is or . Some numbers and their values are shown in the table below. ${\ displaystyle E _ {\ text {max}} = 3}$ ${\ displaystyle E _ {\ text {max}} = 1}$ ${\ displaystyle useed = 256}$ ${\ displaystyle E _ {\ text {max}} = 4}$

If the number of exponent bits is less than , the implicit exponent bits are underlined. ${\ displaystyle E _ {\ text {max}}}$

Values of , , , and for ${\ displaystyle s}$ ${\ displaystyle k}$ ${\ displaystyle e}$ ${\ displaystyle F}$ ${\ displaystyle f}$ ${\ displaystyle E _ {\ text {max}} = 3}$	Interpretation for ${\ displaystyle E _ {\ text {max}} = 3}$	Bit pattern		Values of , , , and for ${\ displaystyle s}$ ${\ displaystyle k}$ ${\ displaystyle e}$ ${\ displaystyle F}$ ${\ displaystyle f}$ ${\ displaystyle E _ {\ text {max}} = 1}$	Interpretation for ${\ displaystyle E _ {\ text {max}} = 1}$
not defined	${\ displaystyle \ infty}$	10000000		not defined	${\ displaystyle \ infty}$
${\ displaystyle k = 6}$	${\ displaystyle useed ^ {6} = 2 ^ {48} \ approx 2 {,} 81 \ cdot 10 ^ {14}}$	0 1111111		${\ displaystyle k = 6, \, e = {\ underline {0}} _ {2} = 0}$	${\ displaystyle useed ^ {6} = 2 ^ {12} = 4096}$
${\ displaystyle k = 5}$	${\ displaystyle useed ^ {5} = 2 ^ {40} \ approx 1 {,} 10 \ cdot 10 ^ {12}}$	0 1111110		${\ displaystyle k = 5, \, e = {\ underline {0}} _ {2} = 0}$	${\ displaystyle useed ^ {5} = 2 ^ {10} = 1024}$
${\ displaystyle k = 4, \, e = 1 {\ underline {00}} _ {2} = 4}$	${\ displaystyle useed ^ {4} \ times 2 ^ {4} = 2 ^ {8 \ times 4 + 4} = 2 ^ {36} \ approx 6 {,} 87 \ times 10 ^ {10}}$	0 111110 1		${\ displaystyle k = 4, \, e = 1}$	${\ displaystyle useed ^ {4} \ cdot 2 ^ {1} = 2 ^ {2 \ cdot 4 + 1} = 512}$
...
${\ displaystyle k = 3, \, e = 01 {\ underline {0}} _ {2} = 2}$	${\ displaystyle useed ^ {3} \ times 2 ^ {2} = 2 ^ {8 \ times 3 + 2} = 2 ^ {26} \ approx 6 {,} 71 \ times 10 ^ {7}}$	0 11110 01	0 11 110 0 1	${\ displaystyle k = 3, \, e = 0, \, F = 1, \, f = 1}$	${\ displaystyle useed ^ {3} \ cdot 2 ^ {0} \ cdot (1 + 1/2 ^ {1}) = 96}$
...
${\ displaystyle k = 2, \, e = 101_ {2} = 5}$	${\ displaystyle useed ^ {2} \ cdot 2 ^ {5} = 2 ^ {8 \ cdot 2 + 5} = 2 ^ {21} = 2 \, 097 \, 152}$	0 1110 101	0 1110 1 01	${\ displaystyle k = 2, \, e = 1, \, F = 2, \, f = 1}$	${\ displaystyle useed ^ {2} \ cdot 2 ^ {1} \ cdot (1 + 1/2 ^ {2}) = 40}$
...
${\ displaystyle k = 1, \, e = 6, \, F = 1, \, f = 1}$	${\ displaystyle useed ^ {1} \ cdot 2 ^ {6} \ cdot 1 {\ tfrac {1} {2}} = 2 ^ {8 \ cdot 1 + 6} \ cdot 1 {\ tfrac {1} {2 }} = 2 ^ {14} \ cdot 1 {\ tfrac {1} {2}} = 2 ^ {14} \ cdot 2 ^ {13} = 24 \, 576}$	0 110 110 1	0 110 1 101	${\ displaystyle k = 1, \, e = 1, \, F = 3, \, f = 101_ {2} = 5}$	${\ displaystyle useed ^ {1} \ cdot 2 ^ {1} \ cdot (1 + 5/2 ^ {3}) = 8 \ cdot 1 {\ tfrac {5} {8}} = 13}$
...
${\ displaystyle k = 1, \, e = 5, \, F = 1, \, f = 0}$	${\ displaystyle useed ^ {1} \ cdot 2 ^ {5} = 2 ^ {8 \ cdot 1 + 5} = 8192}$	0 110 101 0	0 110 1 010	${\ displaystyle k = 1, \, e = 1, \, F = 3, \, f = 010_ {2} = 2}$	${\ displaystyle useed ^ {1} \ cdot 2 ^ {1} \ cdot (1 + 2/2 ^ {3}) = 8 \ cdot 1 {\ tfrac {1} {4}} = 10}$
...
${\ displaystyle k = 0, \, e = 3, \, F = 2, \, f = 2}$	${\ displaystyle useed ^ {0} \ cdot 2 ^ {3} \ cdot 1 {\ tfrac {1} {2}} = 12}$	0 10 011 10	0 10 0 1110	${\ displaystyle k = 0, \, e = 0, \, F = 4, \, f = 14}$	${\ displaystyle 1 + 14/2 ^ {4} = 1 {,} 875}$
${\ displaystyle k = 0, \, e = 3, \, F = 2, \, f = 1}$	${\ displaystyle 8 \ cdot 1 {\ tfrac {1} {4}} = 10}$	0 10 011 01	0 10 0 1101	${\ displaystyle k = 0, \, e = 0, \, F = 4, \, f = 13}$	${\ displaystyle 1 + 13/2 ^ {4} = 1 {,} 8125}$
...
${\ displaystyle k = 0, \, e = 0, \, F = 2, \, f = 3}$	${\ displaystyle 2 ^ {0} \ cdot 1 {\ tfrac {3} {4}} = 1 {\ tfrac {3} {4}} = 1 {,} 75}$	0 10 000 11	0 10 0 0011	${\ displaystyle k = 0, \, e = 0, \, F = 4, \, f = 3}$	${\ displaystyle 1 + 3/16 = 1 {,} 1875}$
${\ displaystyle k = 0, \, e = 0, \, F = 2, \, f = 2}$	${\ displaystyle 1 {\ tfrac {1} {2}} = 1 {,} 5}$	0 10 000 10	0 10 0 0010	${\ displaystyle k = 0, \, e = 0, \, F = 4, \, f = 2}$	${\ displaystyle 1 + 2/16 = 1 {,} 125}$
${\ displaystyle k = 0, \, e = 0, \, F = 2, \, f = 1}$	${\ displaystyle 1 {\ tfrac {1} {4}} = 1 {,} 25}$	0 10 000 01	0 10 0 0001	${\ displaystyle k = 0, \, e = 0, \, F = 4, \, f = 1}$	${\ displaystyle 1 + 1/16 = 1 {,} 0625}$
${\ displaystyle k = 0, \, e = 0, \, F = 2, \, f = 0}$	${\ displaystyle 1}$	0 10 000 00	0 10 0 0000	${\ displaystyle k = 0, \, e = 0, \, F = 4, \, f = 0}$	${\ displaystyle 1}$
${\ displaystyle k = -1, \, e = 7, \, F = 2, \, f = 3}$	${\ displaystyle useed ^ {- 1} \ cdot 2 ^ {7} \ cdot 1 {\ tfrac {3} {4}} = 2 ^ {8-7} \ cdot 1 {\ tfrac {3} {4}} = (1 {\ tfrac {3} {4}}) / 2 = 0 {,} 875}$	0 01 111 11	0 01 1 1111	${\ displaystyle k = -1, \, e = 1, \, F = 4, \, f = 15}$	${\ displaystyle useed ^ {- 1} \ cdot 2 ^ {1} \ cdot (1 + 15/16) = (1 + 15/16) / 2 = 0 {,} 96875}$
${\ displaystyle k = -1, \, e = 7, \, F = 2, \, f = 2}$	${\ displaystyle useed ^ {- 1} \ cdot 2 ^ {7} \ cdot 1 {\ tfrac {1} {2}} = 2 ^ {8-7} \ cdot 1 {\ tfrac {1} {2}} = (1 {\ tfrac {1} {2}}) / 2 = 0 {,} 75}$	0 01 111 10	0 01 1 1110	${\ displaystyle k = -1, \, e = 1, \, F = 4, \, f = 14}$	${\ displaystyle useed ^ {- 1} \ cdot 2 ^ {1} \ cdot (1 + 14/16) = (1 + 14/16) / 2 = 0 {,} 9375}$
${\ displaystyle k = -1, \, e = 7, \, F = 2, \, f = 2}$	${\ displaystyle useed ^ {- 1} \ cdot 2 ^ {7} \ cdot 1 {\ tfrac {1} {4}} = 2 ^ {8-7} \ cdot 1 {\ tfrac {1} {4}} = (1 {\ tfrac {1} {4}}) / 2 = 0 {,} 625}$	0 01 111 01	0 01 1 1101	${\ displaystyle k = -1, \, e = 1, \, F = 4, \, f = 13}$	${\ displaystyle useed ^ {- 1} \ cdot 2 ^ {1} \ cdot (1 + 13/16) = (1 + 13/16) / 2 = 0 {,} 90625}$
...
${\ displaystyle k = -1, \, e = 4, \, F = 2, \, f = 2}$	${\ displaystyle useed ^ {- 1} \ times 2 ^ {4} \ times 1 {\ tfrac {1} {2}} = 1 {\ tfrac {1} {2}} / 16 = 0 {,} 09375}$	0 01 100 10	0 01 1 0010	${\ displaystyle k = -1, \, e = 1, \, F = 4, \, f = 2}$	${\ displaystyle useed ^ {- 1} \ times 2 \ times 1 {\ tfrac {1} {8}} = 1 {\ tfrac {1} {8}} / 2 = 0 {,} 564}$
...
${\ displaystyle k = -1, \, e = 1, \, F = 2, \, f = 1}$	${\ displaystyle useed ^ {- 1} \ cdot 2 ^ {1} \ cdot 1 {\ tfrac {1} {4}} = 2 ^ {- 7} \ cdot 1 {\ tfrac {1} {4}} = 0 {,} 009765625}$	0 01 001 01	0 01 0 0101	${\ displaystyle k = -1, \, e = 0, \, F = 4, \, f = 5}$	${\ displaystyle useed ^ {- 1} \ cdot (1 + 5/16) = (1 + 5/16) / 4 = 0 {,} 328125}$
...
${\ displaystyle k = -2, \, e = 7, \, F = 1, \, f = 1}$	${\ displaystyle useed ^ {- 2} \ cdot 2 ^ {7} \ cdot 1 {\ tfrac {1} {2}} = 2 ^ {- 9} \ cdot 1 {\ tfrac {1} {2}} \ approx 0 {,} 0029297}$	0 001 111 1	0 001 1 111	${\ displaystyle k = -2, \, e = 1, \, F = 3, \, f = 7}$	${\ displaystyle useed ^ {- 2} \ cdot 2 \ cdot 1 {\ tfrac {7} {8}} = 2 ^ {- 3} \ cdot 1 {\ tfrac {7} {8}} = 0 {,} 234375}$
...
${\ displaystyle k = -2, \, e = 2, \, F = 1, \, f = 1}$	${\ displaystyle useed ^ {- 2} \ cdot 2 ^ {5} \ cdot 1 {\ tfrac {1} {2}} = 0 {,} 000732421875}$	0 001 010 1	0 001 0 101	${\ displaystyle k = -2, \, e = 0, \, F = 3, \, f = 5}$	${\ displaystyle useed ^ {- 2} \ cdot 1 {\ tfrac {5} {8}} = 1 {\ tfrac {5} {8}} / 16 = 0 {,} 1015625}$
...
${\ displaystyle k = -3, \, e = 6}$	${\ displaystyle useed ^ {- 3} \ times 2 ^ {6} = 2 ^ {- 8 \ times 3 + 6} = 2 ^ {- 18} \ approx 3 {,} 8147 \ times 10 ^ {- 6} }$	0 0001 110	0 0001 1 10	${\ displaystyle k = -3, \, e = 1, \, F = 2, \, f = 2}$	${\ displaystyle useed ^ {- 3} \ cdot 2 \ times 1 {\ tfrac {1} {2}} = 2 ^ {- 5} \ cdot 1 {\ tfrac {1} {2}} = 0 {,} 046875}$
...
${\ displaystyle k = -4, \, e = 01 {\ underline {0}} _ {2} = 4}$	${\ displaystyle useed ^ {- 4} \ times 2 ^ {4} = 2 ^ {- 8 \ times 4 + 4} = 2 ^ {- 28} \ approx 3 {,} 73 \ times 10 ^ {- 9} }$	0 00001 01	0 00001 0 1	${\ displaystyle k = -4, \, e = 0, \, F = 1, \, f = 1}$	${\ displaystyle useed ^ {- 4} \ cdot 1 {\ tfrac {1} {2}} = 1 {\ tfrac {1} {2}} / 2 ^ {- 8} = 0 {,} 005859375}$
...
${\ displaystyle k = -5, \, e = 1 {\ underline {00}} _ {2} = 4}$	${\ displaystyle useed ^ {- 5} \ cdot 2 ^ {4} = 2 ^ {- 36} \ approx 1 {,} 46 \ cdot 10 ^ {- 11}}$	0 000001 1		${\ displaystyle k = -5, \, e = 1}$	${\ displaystyle useed ^ {- 5} \ cdot 2 ^ {1} = 2 ^ {- 11} \ approx 4 {,} 88 \ cdot 10 ^ {- 4}}$
${\ displaystyle k = -5, \, e = 000_ {2} = 0}$	${\ displaystyle useed ^ {- 5} = 2 ^ {- 40} \ approx 9 {,} 09 \ cdot 10 ^ {- 13}}$	0 000001 0		${\ displaystyle k = -5, \, e = 0}$	${\ displaystyle useed ^ {- 5} = 2 ^ {- 10} \ approx 9 {,} 77 \ cdot 10 ^ {- 4}}$
${\ displaystyle k = -6}$	${\ displaystyle useed ^ {- 6} = 2 ^ {- 48} \ approx 3 {,} 55 \ cdot 10 ^ {- 15}}$	0 0000001		${\ displaystyle k = -6}$	${\ displaystyle useed ^ {- 6} = 2 ^ {- 12} \ approx 2 {,} 44 \ cdot 10 ^ {- 4}}$
not defined	0	00000000		not defined	0

In the five lines around the one it is clear that the reciprocal values cannot be clearly represented, but are mapped to the corresponding values. Exact reciprocal values exist exactly for the numbers with , i.e. powers of two. Because the regime for numbers with a very large or very small amount takes up many bits and thus leaves increasingly few for the mantissa, there are particularly many powers of two at the edges. Large can amplify this effect, but at the cost of numbers near (and accordingly ). In practice it is rather small, so for 8 ‑ bit Unums is a typical choice. ${\ displaystyle f = 0}$ ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle 1}$ ${\ displaystyle -1}$ ${\ displaystyle E _ {\ text {max}}}$ ${\ displaystyle E _ {\ text {max}} = 1}$

Representing a number as a pos

For and , the presentation is clear. For one determines the bit pattern of and applies the two's complement to it. ${\ displaystyle x = 0}$ ${\ displaystyle x = \ infty}$ ${\ displaystyle x <0}$ ${\ displaystyle | x |}$

So be portrayed. First, the run length determine maximum or (in other words) is determined (where the rounds to the nearest integer towards means). This clearly defines the regime bits. If there are bits left for the exponent, determine maximal, ie . If there are any bits left for the mantissa, determine ${\ displaystyle x> 0}$ ${\ displaystyle k \ leq \ log _ {useed} (x)}$ ${\ displaystyle k = \ lfloor \ log _ {useed} (x) \ rfloor}$ ${\ displaystyle \ lfloor \; \ rfloor}$ ${\ displaystyle - \ infty}$ ${\ displaystyle e \ leq \ log _ {2} (x / useed ^ {k})}$ ${\ displaystyle e = \ lfloor \ log _ {2} (x / useed ^ {k}) \ rfloor}$ ${\ displaystyle F}$

{\ displaystyle f _ {\ text {opt}} = (x \ cdot useed ^ {- k} \ cdot 2 ^ {- e} -1) \ cdot 2 ^ {F}}

.

Then applies

{\ displaystyle x = useed ^ {k} \ cdot 2 ^ {e} \ cdot (1 + f _ {\ text {opt}} / 2 ^ {F})}

,

however, it is not necessary to be an integer. Therefore, it is determined as the rounded value of (with the usual rounding to the nearest whole number), where it is rounded to the nearest even number if it is exactly between two whole numbers. Finally, and are represented in binary form, whereby in general there is no substitution but rather an addition. An overflow can occur when rounding off , because if mantissa bits are provided, it can happen that the value needs exactly bits for representation. Formulated as simply as possible, the representation for the positive is added to the fictitious with the representation of (the representations are interpreted as an integer). If it fits in bits, it is the same as if the last bits were simply set. In exceptional cases, this leads to the correct adjustment of the exponent and possibly the regime. An exception handling is not necessary in order not to round up a finite value , since the regime bits at the upper edge of the number range take up the complete representation and are neither calculated nor calculated. ${\ displaystyle f _ {\ text {opt}}}$ ${\ displaystyle f}$ ${\ displaystyle f _ {\ text {opt}}}$ ${\ displaystyle f _ {\ text {opt}}}$ ${\ displaystyle e}$ ${\ displaystyle f}$ ${\ displaystyle f}$ ${\ displaystyle f _ {\ text {opt}}}$ ${\ displaystyle F}$ ${\ displaystyle f}$ ${\ displaystyle (F + 1)}$ ${\ displaystyle f = 0}$ ${\ displaystyle f}$ ${\ displaystyle f}$ ${\ displaystyle F}$ ${\ displaystyle \ infty}$ ${\ displaystyle e}$ ${\ displaystyle f}$

Example for the 8 ‑ bit format with . Then and , so is the run length ; the regime bits are thus 001 . After subtracting the sign bit, there are still an exponent bit and three mantissa bits. For the exponent arises , so rounded . For the mantissa results ${\ displaystyle x = 0 {,} 1}$ ${\ displaystyle E _ {\ text {max}} = 1}$ ${\ displaystyle useed = 4}$ ${\ displaystyle \ log _ {4} (0 {,} 1) \ approx -1 {,} 166}$ ${\ displaystyle k = -2}$ ${\ displaystyle e \ leq \ log _ {2} (1 {,} 16) \ approx 0 {,} 168}$ ${\ displaystyle e = 0}$

{\ displaystyle f _ {\ text {opt}} = (1 {,} 16-1) \ cdot 2 ^ {3} = 0 {,} 16 \ cdot 8 = 4 {,} 18}

and so results after rounding . For the representation you determine and and the result is 0: 001: 0: 101 . Counting results ${\ displaystyle f = 5}$ ${\ displaystyle e = 0 = 0_ {2}}$ ${\ displaystyle f = 5 = 101_ {2}}$

{\ displaystyle useed ^ {k} \ cdot 2 ^ {e} \ cdot (1 + f / 2 ^ {F}) = 4 ^ {- 2} \ cdot 2 ^ {0} \ cdot (1 + 5/8 ) = 1 {\ tfrac {5} {8}} / 16 = 0 {,} 1015625.}

The value can also be found in the table above.

Example for the 8 ‑ bit format with . Then are and . It arises in particular ${\ displaystyle x = 0 {,} 9999}$ ${\ displaystyle E _ {\ text {max}} = 1}$ ${\ displaystyle k = -1}$ ${\ displaystyle e = 1}$

{\ displaystyle f _ {\ text {opt}} = (0 {,} 9999 \ cdot 4 ^ {- (- 1)} \ cdot 2 ^ {- 1} -1) \ cdot 2 ^ {4} = 15 { ,} 99968 \ approx 16}

.

Well does n't fit in bits. The display without would be 0: 01: 1: 0000 which is also added. That gives 0: 10: 0: 0000 , which is the number . The largest number less than is represented by 0: 01: 1: 1111 and is ${\ displaystyle 16 = 10000_ {2}}$ ${\ displaystyle F = 4}$ ${\ displaystyle f}$ ${\ displaystyle 10000_ {2}}$ ${\ displaystyle 1}$ ${\ displaystyle 1}$ ${\ displaystyle 0 {,} 96875.}$

The values can be found in the table above.

Example for the 8 ‑ bit format with . Then and , so is the run length ; the regime bits are thus 01 . There are still 3 exponent and 2 mantissa bits left. The result for the exponent is therefore . For the mantissa results ${\ displaystyle x = 0 {,} 1}$ ${\ displaystyle E _ {\ text {max}} = 3}$ ${\ displaystyle useed = 256}$ ${\ displaystyle \ log _ {256} (0 {,} 1) \ approx -0 {,} 415}$ ${\ displaystyle k = -1}$ ${\ displaystyle e \ leq \ log _ {2} (x \ cdot 256) = \ log _ {2} (25 {,} 6) \ approx 4 {,} 678}$ ${\ displaystyle e = 4 = 100_ {2}}$

{\ displaystyle f _ {\ text {opt}} = (0 {,} 1 \ cdot 256 ^ {1} \ cdot 2 ^ {- 4} -1) \ cdot 2 ^ {2} = (1 {,} 6 -1) \ cdot 2 ^ {2} = 2 {,} 4}

.

From this it follows . The representation is therefore 0: 01: 100: 10 . Counting results ${\ displaystyle f = 2 = 10_ {2}}$

{\ displaystyle useed ^ {k} \ cdot 2 ^ {e} \ cdot (1 + f / 2 ^ {F}) = 256 ^ {- 1} \ cdot 2 ^ {4} \ cdot (1 + 2/4 ) = {\ tfrac {1 {\ tfrac {1} {2}}} {16}} = 0 {,} 09375}

.

This value can also be found in the table above.

The relative error is the first example and the last . The examples show that the larger range of values is paid for with a loss of accuracy. ${\ displaystyle 0 {,} 1015625/0 {,} 1-1 = 1 {,} 5625 \%}$ ${\ displaystyle 0 {,} 1/0 {,} 09375-1 = 6 {\ tfrac {2} {3}} \%}$

Taking advantage of the posit representation

Comparison of the fast sigmoid function and the fast_sigmoidimplementation for 8-bit positions

In an 8 ‑ bit unum with , the sigmoid function can be calculated approximately very quickly by using the representation of the unum. Instead of laboriously evaluating the exponential function, the sign bit is inverted and the representation is logically shifted 2 places to the right. In C this corresponds to the expression . The idea behind the approach is similar to that used to calculate the reciprocal of the square root . ${\ displaystyle E _ {\ text {max}} = 0}$ ${\ displaystyle \ textstyle x \ mapsto {\ tfrac {1} {1+ \ mathrm {e} ^ {- x}}}}$ (x ^ 0x80) >> 2

Standard positions

IEEE ‑ 754 formats exist for the bit numbers 16, 32, 64, 128 and 256. Appropriate choices of approximate their value ranges. ${\ displaystyle E _ {\ text {max}}}$

Bits	Exp. (IEEE)	Range of values (IEEE)	${\ displaystyle E _ {\ text {max}}}$ (Posit)	Range of values (posit)	Quire bits (posit)
${\ displaystyle 8}$	-	-	${\ displaystyle 0}$	${\ displaystyle 2 \ cdot 10 ^ {- 4}}$ to ${\ displaystyle 4 \ cdot 10 ^ {4}}$ 0	n / A.
${\ displaystyle 16}$	${\ displaystyle 5}$	${\ displaystyle 6 \ cdot 10 ^ {- 8}}$ to ${\ displaystyle 7 \ cdot 10 ^ {+ 4}}$	${\ displaystyle 1}$	${\ displaystyle 4 \ cdot 10 ^ {- 9}}$ to ${\ displaystyle 3 \ cdot 10 ^ {+ 8}}$	${\ displaystyle 128 = 2 ^ {7}}$ 0
${\ displaystyle 32}$	${\ displaystyle 8}$	${\ displaystyle 1 \ cdot 10 ^ {- 45}}$ to ${\ displaystyle 3 \ cdot 10 ^ {+ 38}}$	${\ displaystyle 3}$	${\ displaystyle 6 \ cdot 10 ^ {- 73}}$ to ${\ displaystyle 2 \ cdot 10 ^ {+ 72}}$	${\ displaystyle 1024 = 2 ^ {10}}$
${\ displaystyle 64}$	${\ displaystyle 11}$	${\ displaystyle 5 \ cdot 10 ^ {- 324}}$ to ${\ displaystyle 2 \ cdot 10 ^ {+ 308}}$	${\ displaystyle 4}$	${\ displaystyle 2 \ cdot 10 ^ {- 299}}$ to ${\ displaystyle 4 \ cdot 10 ^ {+ 298}}$	${\ displaystyle 4096 = 2 ^ {12}}$
${\ displaystyle 128}$	${\ displaystyle 15}$	${\ displaystyle 6 \ cdot 10 ^ {- 4966}}$ to ${\ displaystyle 1 \ cdot 10 ^ {+ 4932}}$	${\ displaystyle 7}$	${\ displaystyle 1 \ cdot 10 ^ {- 4855}}$ to ${\ displaystyle 1 \ cdot 10 ^ {+ 4855}}$	${\ displaystyle 65 \, 536 = 2 ^ {16}}$
${\ displaystyle 256}$	${\ displaystyle 19}$	${\ displaystyle 2 \ cdot 10 ^ {- 78,984}}$ to ${\ displaystyle 2 \ cdot 10 ^ {+ 78 \, 913}}$	${\ displaystyle 10}$	${\ displaystyle 2 \ cdot 10 ^ {- 78 \, 297}}$ to ${\ displaystyle 5 \ cdot 10 ^ {+ 78 \, 296}}$	${\ displaystyle 1 \, 048 \, 576 = 2 ^ {20}}$

supporting documents

^ Walter F. Tichy : The End of (Numeric) Error: An interview with John L. Gustafson Archived from the original on July 10, 2016. In: Association for Computing Machinery (ACM) (Ed.): Ubiquity - Information Everywhere . 2016, April 2016, pp. 1–14. doi : 10.1145 / 2913029 . Retrieved July 10, 2016. "JG: The word unum is short for universal number, the same way the word bit is short for binary digit. "
^ ^A ^b ^c John L. Gustafson: A Radical Approach to Computation with Real Numbers . In: Supercomputing Frontiers and Innovations . doi : 10.14529 / jsfi160203 (English, superfri.org [PDF; 2.4 MB ; accessed on January 6, 2020]).
↑ ^a ^b ^c ^d John L. Gustafson, Isaac Yonemoto: Beating Floating Point at its Own Game: Posit Arithmetic. (PDF) June 12, 2017, accessed on December 28, 2019 .
↑ Invoice at WolframAlpha

[Tichy_2016_1-1] Walter F. Tichy : The End of (Numeric) Error: An interview with John L. Gustafson Archived from the original on July 10, 2016. In: Association for Computing Machinery (ACM) (Ed.): Ubiquity - Information Everywhere . 2016, April 2016, pp. 1–14. doi : 10.1145 / 2913029 . Retrieved July 10, 2016. "JG: The word unum is short for universal number, the same way the word bit is short for binary digit. "

[approach-2] A ^b ^c John L. Gustafson: A Radical Approach to Computation with Real Numbers . In: Supercomputing Frontiers and Innovations . doi : 10.14529 / jsfi160203 (English, superfri.org [PDF; 2.4 MB ; accessed on January 6, 2020]).

[Posits-3] John L. Gustafson, Isaac Yonemoto: Beating Floating Point at its Own Game: Posit Arithmetic. (PDF) June 12, 2017, accessed on December 28, 2019 .

[Beispiel_WolframAlpha-4] Invoice at WolframAlpha