Floating point numbers in digital audio applications

Floating point numbers in digital audio applications are primarily found as 32- bit or 64-bit floating point numbers in mastering , both natively on the computer CPU and outsourced to the DSP of an internal or external sound card . Most modern HD recording systems such as Nuendo , Logic , Samplitude , SADiE or Pyramix work with floating point arithmetic. Audio editors such as Adobe Audition or Pro Tools also allow the export of audio files in 32 bit floating point coding for further mastering. Adobe Audition (in the current version with 64 bit software architecture ) works internally with 32 bit floating point resolution, so that the export and subsequent import of 32 bit floating point coded audio files does not generate any further quantization noise, which would increase with every conversion of the amplitude resolution. The bit depth of the software architecture is therefore independent of the bit depth of the floating point arithmetic used. Pro Tools, on the other hand (worked with 48-bit fixed-point arithmetic up to version 10) and REAPER work with 64-bit floating-point numbers in the current version.

This bit depth is primarily relevant for the accuracy of the variables within the program processing, which are then transferred to the ALU or the internal FPU of the computer CPU for further calculation , where they are converted to their register size, and for example by an AMD Bulldozer with 256 bit accuracy. Even the x87 Intel coprocessors from the 80s calculated floating point numbers in an IEEE-compliant manner with up to 80 bit extended. Only at the end of the calculations is the software rounded to the bit depth. Effect plugins are often calculated by the DSP of an internal sound card or even by external DSP servers to relieve the CPU, depending on the software and hardware configuration .

While in fixed-point arithmetic the maximum level is 0 dBFS and a further increase in the level leads to a clipping of the amplitude and thus to clipping , the floating point display enables headroom . The dynamic range of an audio signal represented in floating point numbers is therefore divided into two ranges 0 dBFS and 0 dBFS. The maximum resolution, however, is limited by the mantissa of the glide ratio. A 32 bit floating point arithmetic according to IEEE 754 therefore has a maximum of 23 bits of resolution per half-wave, with a sign bit that is 24 bits for the entire positive and negative modulation range. In contrast to this, fixed-point arithmetic with 32 bits resolves a half-wave with the deduction of the sign bit with 31 bits, i.e. the entire positive and negative modulation range with 32 bits. Only 64 bit floating point arithmetic ( double precision ) with its 52 bit mantissa is completely superior to 32 bit fixed point arithmetic (i.e. with 53 bit resolution including sign bit for the entire positive and negative modulation range). ${\ displaystyle \ leq}$ ${\ displaystyle>}$

A decimal fixed point number can always be represented by several glide ratios, the decimal number 0.375 for example by the glide ratio · 0.375 (with 0 as exponent and 0.375 as mantissa ). If one were to normalize the mantissa m, for example, to a range of 0.5 1 , the result would be * 0.75. The IEEE 754 standard, which is binding for the audio sector , now provides for a normalization of the mantissa to a range of , so that the value 0.375 is unmistakably represented by the glide ratio · 1.5. Since the mantissa normalized in this way always has a 1 in front of the comma, this 1 is no longer listed in the binary representation of the glide ratio, but is implicitly defined as a hidden bit . Values below the smallest value that can be displayed with a normalized mantissa are denormalized according to IEEE 754 and displayed with an implicit 0 in front of the mantissa. However, these denormalized glide ratios lead to a 30-fold slowdown in computing speed on various FPUs and are therefore not used in the audio sector, since this extreme dynamic range below −144 dBFS cannot be used, and the smallest audio signal that can be used in the audible frequency range of 20 kHz is Size is determined by the noise level, is by far superimposed. ${\ displaystyle 2 ^ {0}}$ ${\ displaystyle \ leq m <}$ ${\ displaystyle 2 ^ {- 1}}$ ${\ displaystyle 1 \ leq m <2}$ ${\ displaystyle 2 ^ {- 2}}$ ${\ displaystyle <}$

32 bit fixed point arithmetic

Since an increase in amplitude by a factor of 10 results in an increase in dynamics by 20 dB , the dynamics x of a 32-bit fixed point system is calculated using the formula = dB; ⇒ x: 20 = lg dB; ⇒ x = 20 lg dB = 192.65919722494796493679289262368 dB ≈ 193 dB. In software and DSPs with fixed point arithmetic, the two's complement representation is used with a sign bit as the most significant bit. The range of values is divided into a positive and a negative range, so that only half of the entire range of values is available for an amplitude or half-wave with a resolution of 31 bits each ( : 2 = ). In the field of audio technology, it is common to relate the dynamic range of an audio system to the entire positive and negative modulation range, while the scientific definition of dynamics only refers to the unsigned amplitude (or half-wave). Only half of the total dynamic range is available per amplitude or half-wave (factor 2 or 6 dB level difference). ${\ displaystyle 10 ^ {(x: 20)}}$ ${\ displaystyle 2 ^ {32}}$ ${\ displaystyle 2 ^ {32}}$ ${\ displaystyle 2 ^ {32}}$ ${\ displaystyle 2 ^ {32}}$ ${\ displaystyle 2 ^ {31}}$

32 bit floating point arithmetic

The positive and negative audio signals up to 0 dBFS are represented with floating point arithmetic by definition (according to Convention Paper 7438 of the Audio Engineering Society ) in a range from −1 to +1, with the values −1 and +1 representing the maximum level 0 dBFS. Values beyond this represent the headroom. Since the total dynamic range shown doubles with each increase in the exponent of the floating point number by the value 1, the maximum theoretical dynamic of a 32 bit floating point arithmetic is calculated (subtracting the exponent 0 and the exponent for the value infinite) from the 254 exponents available with 254 · 20 · 2 dB ≈ 1529 dB. The negative range is represented with the exponents up to −126, the positive value range with the exponent 0. A change in the dynamics in dB by a factor of 2 (corresponding to ± 1 bit resolution) results in a gain L of 6 dB, calculated from the amplitude ratio v according to the formula L = 20 v dB = 20 (2: 1) dB = 20 · 2 dB = 6.0205999132796239042747778944899 dB ≈ 6 dB. Theoretically, the 32-bit floating point system would have (1529 dB - 193 dB): 6 dB = 222 (calculated with the unrounded values) exactly 222 times higher dynamic than a 32 bit fixed point system. The higher dynamics achieved in this way can, however, by far not be used due to the exponentially increasing spread of the amplitude resolution and the associated rounding errors. Values of 2 or 144 dBFS already result in a halving of the resolution, while values 1 and 0 dBFS are consistently displayed with the 23-bit amplitude resolution resulting from the mantissa. ${\ displaystyle 2 ^ {128}}$ ${\ displaystyle 2 ^ {8} -2}$ ${\ displaystyle \ lg}$ ${\ displaystyle \ geq}$ ${\ displaystyle \ lg}$ ${\ displaystyle \ lg}$ ${\ displaystyle \ lg}$ ${\ displaystyle>}$ ${\ displaystyle>}$ ${\ displaystyle \ leq}$ ${\ displaystyle \ leq}$

Quantization noise ratio

While the signal-to-noise ratio of fixed-point arithmetic falls in direct proportion to the level of the audio signal and the percentage of quantization noise increases as the level decreases, the signal-to-noise ratio of floating point arithmetic is determined by the power of the quantization noise of the mantissa - regardless of the exponent across all levels represented with normalized mantissa . When resolving an integer value of a WAV file or an audio signal from the AD converter using the 23 bit mantissa, the quantization or rounding error can amount to a maximum of one quantization interval, i.e. = . The noise power is calculated from the square of the effective value (sigma) of the amplitude (or a half-wave): ${\ displaystyle {\ text {SNR}}}$ ${\ displaystyle q}$ ${\ displaystyle q}$ ${\ displaystyle q_ {max} = 1: 2 ^ {23}}$ ${\ displaystyle 2 ^ {- 23}}$ ${\ displaystyle \ sigma}$

 $\sigma ^{2}={\frac {q^{2}}{12}}$  sowie  ${\text{SNR}}=10\;\lg \left({\frac {1}{\sigma ^{2}}}\right)\mathrm {dB}$ ; ⇒  ${\text{SNR}}=10\;\lg \left({\dfrac {1}{\dfrac {q^{2}}{12}}}\right)\mathrm {dB}$ ; ⇒  ${\text{SNR}}=10\;\lg \left({\frac {12}{q^{2}}}\right)\mathrm {dB}$ ; ⇒  ${\text{SNR}}=10\;\lg \left({\frac {12}{(2^{-23})^{2}}}\right)\mathrm {dB}$  ≈ 149  $\mathrm {dB}$ ;

If the effects of the less favorable peak value of the quantization noise are taken as a basis for very small values of the mantissa, then the signal-to-noise ratio of the amplitude (or a half-wave) is also calculated

 ${\text{SNR}}=10\;\lg \left({\frac {1}{q^{2}}}\right)\mathrm {dB}$ ; ⇒  ${\text{SNR}}=10\;\lg \left({\frac {1}{(2^{-23})^{2}}}\right)\mathrm {dB}$  ≈ 138  $\mathrm {dB}$ ;

The graph of the quantization noise level can thus be displayed as a sawtooth diagram, since the quantization noise ratio within the display range of an exponent drops in direct proportion to the input signal (or the integer value to be displayed) from its ideal value of 149 dB in the worst case to 138 dB. Since a sample describes either a positive or a negative half-wave and the quantization error in a glide ratio can only occur once, the quantization-to-noise ratio of the overall dynamics doubles again by a factor of 2 (corresponding to +6 over the negative and positive modulation range of the audio signal) dB), so that the total usable dynamic range 0 dBFS fluctuates between 155 dB and (in the worst case) 144 dB: ${\ displaystyle q_ {ges}}$ ${\ displaystyle \ leq}$

138  $\mathrm {dB}$  + 6  $\mathrm {dB} \leq q_{ges}\leq$  149  $\mathrm {dB}$  + 6  $\mathrm {dB}$ ; ⇒ 144  $\mathrm {dB} \leq q_{ges}\leq$  155  $\mathrm {dB}$

Standard dynamic range ≤ 0 dBFS

If a file with fixed point representation is integrated into a system with floating point arithmetic, the positive and negative audio signals are algebraically normalized to the value range from −1 to +1. The binary fixed point number with the word length w and the individual bits b (with the value 0 or 1) is evaluated with the sum of (sign bit) + · 0.5 + · 0.25 + · 0.125 + ... + · . This algebraic normalization means that the value of the fixed-point number is fitted linearly in the floating point value range 0 to 1 (which should not be confused with the volume normalization , which would also take place in floating point arithmetic to the maximum value −1 to +1, corresponding to 0 dBFS) . The individual levels of the integer values of the fixed-point number then correspond to the levels of the floating-point number of size 1: ≈ 1 · , which extend evenly over the partial value ranges that are logarithmically graded by the exponents: ${\ displaystyle -b_ {0}}$ ${\ displaystyle b_ {1}}$ ${\ displaystyle b_ {2}}$ ${\ displaystyle b_ {3}}$ ${\ displaystyle b_ {w-1}}$ ${\ displaystyle 2 ^ {- (w-1)}}$ ${\ displaystyle \ leq}$ ${\ displaystyle 2 ^ {23}}$ ${\ displaystyle 10 ^ {- 7}}$

Dispersion of the amplitude resolution with 32 bit floating point arithmetic ${\ displaystyle \ Delta}$
1.175 ≈ 1 (smallest normalized number) ${\ displaystyle 10 ^ {- 38}}$ ${\ displaystyle 2 ^ {- 126}}$	...	0.25 - 0.00000001 = · 1.99999999 ${\ displaystyle 2 ^ {- 3}}$	0.5 - 0.00000003 ≈ 1.9999999 ${\ displaystyle 2 ^ {- 2}}$	1 - 0.00000006 = · 1.99999999 ${\ displaystyle 2 ^ {- 1}}$	2 - 0.0000001 = · 1.99999999 ${\ displaystyle 2 ^ {0}}$
1.1754945 = 1.0000001 ${\ displaystyle 10 ^ {- 38}}$ ${\ displaystyle 2 ^ {- 126}}$	...	0.25 = x 1 ${\ displaystyle 2 ^ {- 2}}$	0.5 = · 1 ${\ displaystyle 2 ^ {- 1}}$	1 = 1 (0 dBFS) ${\ displaystyle 2 ^ {0}}$	2 = 1 ${\ displaystyle 2 ^ {1}}$
1.1754946 = 1.0000002 ${\ displaystyle 10 ^ {- 38}}$ ${\ displaystyle 2 ^ {- 126}}$	...	0.25 + 0.00000003 = * 1.0000001 ${\ displaystyle 2 ^ {- 2}}$	0.5 + 0.00000006 ≈ 1.0000001 ${\ displaystyle 2 ^ {- 1}}$	1 + 0.0000001 ≈ 1.0000001 ${\ displaystyle 2 ^ {0}}$	2 + 0.0000002 ≈ 1.0000001 ${\ displaystyle 2 ^ {1}}$
${\ displaystyle \ geq}$ - 144 dBFS	...	${\ displaystyle \ geq}$ - 12 dBFS	${\ displaystyle \ geq}$ - 6 dBFS	${\ displaystyle \ leq}$ 0 dBFS	${\ displaystyle \ leq}$ + 6 dBFS
${\ displaystyle \ Delta}$ min. = 1.5 ${\ displaystyle 10 ^ {- 45}}$	...	${\ displaystyle \ Delta \ geq 2 ^ {- 2}}$ = 3 ${\ displaystyle 10 ^ {- 8}}$	${\ displaystyle \ Delta \ geq 2 ^ {- 1}}$ = 6 ${\ displaystyle 10 ^ {- 8}}$	${\ displaystyle \ Delta \ leq 2 ^ {0}}$ = 1 ${\ displaystyle 10 ^ {- 7}}$	${\ displaystyle \ Delta \ leq 2 ^ {1}}$ = 1 ${\ displaystyle 10 ^ {- 7}}$

4 - 0.0000002 ≈ 1.4999999 ${\ displaystyle 2 ^ {1}}$	...	${\ displaystyle 2 ^ {24}}$ - 1 ≈ 1.9999999 ${\ displaystyle 2 ^ {23}}$	${\ displaystyle 2 ^ {25}}$ - 2 ≈ 1.9999999 ${\ displaystyle 2 ^ {24}}$	${\ displaystyle 2 ^ {127}}$ - ≈ 1.9999999 ${\ displaystyle 10 ^ {31}}$ ${\ displaystyle 2 ^ {126}}$	3.4028235 - 2 ≈ 1.9999998 ${\ displaystyle 10 ^ {38}}$ ${\ displaystyle 10 ^ {31}}$ ${\ displaystyle 2 ^ {127}}$
4 = 1 ${\ displaystyle 2 ^ {2}}$	...	${\ displaystyle 2 ^ {24}}$ = · 1 ${\ displaystyle 2 ^ {24}}$	${\ displaystyle 2 ^ {25}}$ = · 1 ${\ displaystyle 2 ^ {25}}$	${\ displaystyle 2 ^ {127}}$ = · 1 ${\ displaystyle 2 ^ {127}}$	3.4028235 ≈ 1.99999999 ${\ displaystyle 10 ^ {38}}$ ${\ displaystyle 2 ^ {127}}$
4 + 0.0000004 ≈ 1.5000001 ${\ displaystyle 2 ^ {1}}$	...	${\ displaystyle 2 ^ {24}}$ + 2 ≈ 1.0000001 ${\ displaystyle 2 ^ {24}}$	${\ displaystyle 2 ^ {25}}$ + 4 ≈ 1.0000001 ${\ displaystyle 2 ^ {25}}$	${\ displaystyle 2 ^ {127}}$ + ≈ 1.0000001 ${\ displaystyle 2 \ cdot 10 ^ {31}}$ ${\ displaystyle 2 ^ {127}}$
${\ displaystyle \ Delta \ leq 2 ^ {2}}$ = 2 ${\ displaystyle 10 ^ {- 7}}$	...	${\ displaystyle \ Delta \ leq 2 ^ {24}}$ = 1	${\ displaystyle \ Delta \ leq 2 ^ {25}}$ = 2	${\ displaystyle \ Delta \ leq 2 ^ {127}}$ = ${\ displaystyle 10 ^ {31}}$	${\ displaystyle \ Delta}$ ${\ displaystyle <2 ^ {128}}$ = ${\ displaystyle 2 \ cdot 10 ^ {31}}$

Headroom> 0 dBFS

The individual partial value range roughly represented with the respective exponent is always finely resolved on its own with the 23 bit mantissa (i.e. with 23 bit resolution), but this resolution spreads over an ever larger partial value range with every increase in the exponent by the value 1 , which means that the resolution continues to decrease relative to the resolution of the smallest representable value range. From a certain range, the jumps in the displayability of the partial value range are equal to the entire standard dynamic range up to 0 dBFS. For example, if the range of values up to (represented by the glides 1.0 to 1.0) used for the headroom still has an amplitude resolution of : (i.e. the full resolution of the 23-bit mantissa), a theoretical amplitude value is already scattered 1, i.e. corresponding to the entire dynamic range of 0 dBFS, corresponding to a resolution of 1: 1. When calculating the overall dynamics or the overall resolution, bear in mind that the amplitude resolution only refers to the unsigned value of the amplitude (a half-wave). ${\ displaystyle 1}$ ${\ displaystyle 2 ^ {1}}$ ${\ displaystyle 2 ^ {0}}$ ${\ displaystyle 2 ^ {1}}$ ${\ displaystyle 1}$ ${\ displaystyle 2 ^ {23}}$ ${\ displaystyle> 2 ^ {23}}$ ${\ displaystyle \ Delta}$ ${\ displaystyle \ leq}$

The maximum dynamic of the headroom that can be used without loss of resolution or significant rounding errors on its own (i.e. not in relation to the standard dynamic range) over the entire negative and positive modulation range is calculated from the number of bits of the amplitude resolution plus the sign bit and the amplitude ratio v des value to be represented for the smallest possible quantization unit with 20 · lg dB = 20 · lg dB = 144.49439791871097370259466946776 dB ≈ 144 dB. Since the standard dynamic range also comprises 144 dB due to its smallest signal-to-noise ratio, there is a doubling of the dynamic range, i.e. for the logarithmic scale in dB (again calculated from the amplitude ratio v ) the dynamic range is increased by 6 dB to a total of 150 dB. ${\ displaystyle 1: 2 ^ {24}}$ ${\ displaystyle 2 ^ {- 24}}$

It should be noted that the 1 from this quotient refers to the (whole) value to be displayed, but not to the floating point number corresponding to 0 dBFS, which only happens to have the same value 1 here. When calculating the total dynamic range that can be used without loss of resolution, the floating point number 2 corresponding to + 6 dBFS can not simply be used for the dividends , otherwise the linear scale of the quantization units would be mixed up in an inadmissible manner with the scale of the floating point numbers, which spreads further upwards. Rather, the divisor has to be multiplied by itself again by doubling the dynamic range, so that the total dynamic range that can be used without loss of resolution can also be calculated with 20 lg dB = 20 lg dB = 150.51499783199059760686944736225 dB ≈ 150 dB. ${\ displaystyle 1: 2 ^ {25}}$ ${\ displaystyle 2 ^ {- 25}}$

The headroom of a 32-bit floating point arithmetic that can be used without loss of resolution is 6 dB and enables the summation of , i.e. only four incoherent mono channels (different audio signals in mono), each with 0 dBFS full modulation. ${\ displaystyle 10 ^ {6:10}}$

Amplitude resolution with 32 bit floating point arithmetic (x = amplitude resolution; y = value range)
x in bit	23	23	23	23	23	23	22nd	21st	20th
y as a power of ten	${\ displaystyle>}$ 1.17549435 ${\ displaystyle 10 ^ {- 38}}$	...	${\ displaystyle>}$ 0.125	${\ displaystyle>}$ 0.25	${\ displaystyle>}$ 0.5	${\ displaystyle>}$ 1	${\ displaystyle>}$ 2	${\ displaystyle>}$ 4th	${\ displaystyle>}$ 8th
y as a power	${\ displaystyle> 2 ^ {- 126}}$	...	${\ displaystyle> 2 ^ {- 3}}$	${\ displaystyle> 2 ^ {- 2}}$	${\ displaystyle> 2 ^ {- 1}}$	${\ displaystyle> 2 ^ {0}}$	${\ displaystyle> 2 ^ {1}}$	${\ displaystyle> 2 ^ {2}}$	${\ displaystyle> 2 ^ {3}}$
Overall dynamics	${\ displaystyle>}$ - 144 dBFS	...	${\ displaystyle>}$ - 18 dBFS	${\ displaystyle>}$ - 12 dBFS	${\ displaystyle>}$ - 6 dBFS	${\ displaystyle>}$ 0 dBFS	${\ displaystyle>}$ + 6 dBFS
Amplitude resolution	${\ displaystyle 1: 2 ^ {23}}$	${\ displaystyle 1: 2 ^ {23}}$	${\ displaystyle 1: 2 ^ {23}}$	${\ displaystyle 1: 2 ^ {23}}$	${\ displaystyle 1: 2 ^ {23}}$	${\ displaystyle 1: 2 ^ {23}}$	${\ displaystyle 1: 2 ^ {22}}$	${\ displaystyle 1: 2 ^ {21}}$	${\ displaystyle 1: 2 ^ {20}}$
scattering ${\ displaystyle \ Delta}$	1.5 ${\ displaystyle 10 ^ {- 45}}$	...	0.1 ${\ displaystyle 10 ^ {- 7}}$	0.3 ${\ displaystyle 10 ^ {- 7}}$	0.6 ${\ displaystyle 10 ^ {- 7}}$	1 · ${\ displaystyle 10 ^ {- 7}}$	2 · ${\ displaystyle 10 ^ {- 7}}$	5 · ${\ displaystyle 10 ^ {- 7}}$	10 · ${\ displaystyle 10 ^ {- 7}}$

x in bit	19th	18th	17th	16	...	2	1	0	−1
y as a power of ten	${\ displaystyle>}$ 16	${\ displaystyle>}$ 32	${\ displaystyle>}$ 64	${\ displaystyle>}$ 128	...	${\ displaystyle>}$ 2.1 · ${\ displaystyle 10 ^ {7}}$	${\ displaystyle>}$ 4.19 · ${\ displaystyle 10 ^ {7}}$	${\ displaystyle>}$ 8.39 · ${\ displaystyle 10 ^ {7}}$	${\ displaystyle>}$ 1.67 ${\ displaystyle 10 ^ {8}}$
y as a power	${\ displaystyle> 2 ^ {4}}$	${\ displaystyle> 2 ^ {5}}$	${\ displaystyle> 2 ^ {6}}$	${\ displaystyle> 2 ^ {7}}$	...	${\ displaystyle> 2 ^ {21}}$	${\ displaystyle> 2 ^ {22}}$	${\ displaystyle> 2 ^ {23}}$	${\ displaystyle> 2 ^ {24}}$
Amplitude resolution	${\ displaystyle 1: 2 ^ {19}}$	${\ displaystyle 1: 2 ^ {18}}$	${\ displaystyle 1: 2 ^ {17}}$	${\ displaystyle 1: 2 ^ {16}}$	...	${\ displaystyle 1: 2 ^ {2}}$	${\ displaystyle 1: 2 ^ {1}}$	${\ displaystyle 1: 2 ^ {0}}$ (= 1: 1)	${\ displaystyle 1: 2 ^ {- 1}}$ (= 1: 0.5)
scattering ${\ displaystyle \ Delta}$	19 · ${\ displaystyle 10 ^ {- 7}}$	38 · ${\ displaystyle 10 ^ {- 7}}$	76 · ${\ displaystyle 10 ^ {- 7}}$	153 ${\ displaystyle 10 ^ {- 7}}$	...	0.25 (= 2,500,000 ) ${\ displaystyle 10 ^ {- 7}}$	0.5 (= 5,000,000 ) ${\ displaystyle 10 ^ {- 7}}$	1 (= 10,000,000 ) ${\ displaystyle 10 ^ {- 7}}$	2

Scatter Δ:

${\ displaystyle \ Delta = \ vert {\ text {partial value range}} \ vert: 2 ^ {23}}$

Amplitude resolution x:

${\ displaystyle {\ begin {aligned} 2 ^ {x} & = \ left ({\ frac {2 ^ {23}} {\ Delta \ cdot {10 ^ {7}}}} \ right) \ rightarrow x = \ log _ {2} \ left ({\ frac {2 ^ {23}} {\ Delta \ cdot 10 ^ {7}}} \ right) \\ & \ rightarrow x = \ log _ {2} \ left ( {\ dfrac {2 ^ {23}} {\ dfrac {\ vert {\ text {partial value range}} \ vert \ cdot 10 ^ {7}} {2 ^ {23}}}} \ right) \ rightarrow x = \ log _ {2} \ left ({\ frac {2 ^ {46}} {\ vert {\ text {partial value range}} \ vert \ cdot 10 ^ {7}}} \ right) \ end {aligned}}}$

Computer interfaces

In March 2001, Apple introduced Mac OS X, a fundamentally revised version of the operating system that brings with it Core Audio, an audio subsystem that is internally based on floating point numbers . Only in the case of the interface towards the connected hardware components or when exporting to files can optionally be converted to integer logic. Since version 10.3, the extensions of the AudioUnits can be used to carry out this operation . For this purpose, a corresponding converter object only needs to be instantiated and attached to the process chain at the right place.

DirectSound , on the other hand, Microsoft's own interface for sound processing, only works with integers .

By Steinberg Media Technologies (later Pinnacle Systems and now Yamaha ) developed, specifically designed for low latency interface called ASIO is intended for connection of devices to produce sounds and supports sample rates from 32 kHz to 192 kHz with word widths of 16, 24 and 32 bit in integer notation as well as 32-bit and 64-bit floating point values.

Plugins that have been programmed for the RTAS interface of Pro Tools, for Steinberg's VST environment, or that have been implemented for generic use under Mac and Logic as Audio Units , are also addressed with floating point values.

The DXi interface , which was developed by Cakewalk (TwelveToneSystems) for their Sonar and HomeStudio products , is the exception: only whole-number arithmetic must be used.

Most current programs that process audio data now use floating point numbers for internal data processing, since most signal sources would quickly blow the headroom of the digital mixer in integer arithmetic (see #Summenbuses ). The manufacturers of the following applications advertise the use of floating point numbers as an internal data format:

In this context, Pro Tools plays a special role with the external DSP racks of the TDM systems used by Digidesign , as the DSPs from Motorola used there only master integer arithmetic.

File formats

Floating point numbers are also used in some data formats for storing multimedia information in order to avoid conversion errors that could otherwise occur during coding and decoding . The following table gives an overview of frequently used, non-data-reduced audio formats and shows the number formats in which they can store audio information.

format	8-bit int	16-bit Int	24-bit Int	32-bit Int	32-bit FP	64-bit FP
Microsoft WAV format (little endian)	x	x	x	x	x	x
Apple / SGI AIFF format (big endian)	x	x	x	x	x	x
Sun / NeXT AU format (big endian)	x	x	x	x	x	x
RAW PCM data	x	x	x	x	x	x
Ensoniq PARIS file format	x	x	x	x
Amiga IFF / SVX8 / SV16 format	x	x
Sphere NIST format	x	x	x	x
VOC files		x
Berkeley / IRCAM / CARL		x	x	x	x
Sonic Foundry's 64 bit RIFF / WAV		x	x	x	x	x
Matlab (tm) / GNU Octave		x		x	x	x
Portable Voice Format	x	x		x
Audio Visual Research	x	x
MS WAVE with WAVEFORMATEX		x	x	x	x	x

(FP = floating point, Int = integer)

In the case of data-reduced formats, the analysis is somewhat more difficult, as these are based on container formats that usually allow all notations internally. So it is for example in Apple's QuickTime - Codecs possible to store floating point numbers directly. However, there is still no medium for end users on which samples in floating point numbers are used in a standardized form.

literature

Udo Zölzer: Audio processing systems. (PDF; 117 kB) Springer Verlag, accessed on June 10, 2016 (Chapter 4).
Stefan Weinzierl: Kommunikationstechnik II. (PDF; 2 MB) TU Berlin, p. 33 , accessed on June 10, 2016 (Section 2.8 Number representation and number format).

Individual evidence

↑ SoundGrid Servers. Waves Inc., accessed June 22, 2016 .
↑ Marcel Beuler: Realization of arithmetic assemblies for the 32-bit floating point format of the ANSI / IEEE 754 standard using VHDL. (PDF; 548 kB) FH Gießen-Friedberg, April 2008, p. 3 (p. 7 of the PDF) , accessed on June 28, 2016 .
↑ Jonas Ekeroot: Audio Software Development. (PDF; 630 kB) August 9, 2007, p. 26 , accessed on June 24, 2016 (English).
^ Uwe Martens: High Resolution Audio - Audio Analysis. July 1, 2015, accessed on June 23, 2016 (see line 8: Noise floor 32 bit floating point).
↑ Introduction to digital signal processing. (PDF; 35.1 MB) ELV Elektronik AG, p. 34 , accessed on June 10, 2016 .
^ Windows Sysinternals - Processing the Audio Data. Microsoft, accessed on June 11, 2016 (English): "16-bit audio is represented by a signed integer with a range from -32768 to 32767"
↑ Michael Talbot-Smith: Audio Engineer's Reference Book. P. 103 , accessed on June 16, 2016 (English).
^ AES Convention Paper 7438 - Audio software development. (PDF; 303 kB) Audio Engineering Society, May 2008, p. 7 , accessed on June 16, 2016 (English).
↑ Udo Zölzer: Digital Audio Signal Processing. P. 55 , accessed on June 16, 2016 (English): "dynamic range for floating-point representation"
↑ Floating point unit demonstration on STM32 microcontrollers. (PDF; 787 kB) STMicroelectronics, May 2016, p. 6 , accessed on June 22, 2016 (English).
↑ The decibel. Detlef Mietke, accessed on June 12, 2016 .
↑ Calculate: Gain and loss as a factor in the level in decibels (dB). Eberhard Sengpiel, accessed on June 12, 2016 .
^ Men Muheim: Design and Implementation of a Commodity Audio System. (PDF; 7.5 MB) 2003, p. 53 , accessed on June 24, 2016 (English, doctoral thesis).
↑ Udo Zölzer: Digital Audio Signal Processing. P. 56 , accessed on June 16, 2016 (English): "the signal-to-noise ratio is independent of the level of the input"
↑ Jonas Ekeroot: Audio Software Development. (PDF; 630 kB) August 9, 2007, p. 29 , accessed on June 24, 2016 (English).
↑ Udo Zölzer: Digital Audio Signal Processing. P. 49 , accessed on June 16, 2016 (English, decimal evaluation of the bits).
^ W. Kahan: Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic. (PDF; 115 kB) October 1, 1997, accessed on June 12, 2016 (English).
↑ IEEE 754 Converter. Harald Schmidt, accessed on June 5, 2016 (online calculation script for 32 bit floating point calculation).
↑ Decibel (dB) to Float Value Calculator. Play Dot Sound, accessed June 16, 2016 .
↑ Sum level of incoherent sound sources. Alexander Sengpiel, accessed June 16, 2016 .

[1] SoundGrid Servers. Waves Inc., accessed June 22, 2016 .

[2] Marcel Beuler: Realization of arithmetic assemblies for the 32-bit floating point format of the ANSI / IEEE 754 standard using VHDL. (PDF; 548 kB) FH Gießen-Friedberg, April 2008, p. 3 (p. 7 of the PDF) , accessed on June 28, 2016 .

[3] Jonas Ekeroot: Audio Software Development. (PDF; 630 kB) August 9, 2007, p. 26 , accessed on June 24, 2016 (English).

[4] Uwe Martens: High Resolution Audio - Audio Analysis. July 1, 2015, accessed on June 23, 2016 (see line 8: Noise floor 32 bit floating point).

[5] Introduction to digital signal processing. (PDF; 35.1 MB) ELV Elektronik AG, p. 34 , accessed on June 10, 2016 .

[6] Windows Sysinternals - Processing the Audio Data. Microsoft, accessed on June 11, 2016 (English): "16-bit audio is represented by a signed integer with a range from -32768 to 32767"

[7] Michael Talbot-Smith: Audio Engineer's Reference Book. P. 103 , accessed on June 16, 2016 (English).

[8] AES Convention Paper 7438 - Audio software development. (PDF; 303 kB) Audio Engineering Society, May 2008, p. 7 , accessed on June 16, 2016 (English).

[9] Udo Zölzer: Digital Audio Signal Processing. P. 55 , accessed on June 16, 2016 (English): "dynamic range for floating-point representation"

[10] Floating point unit demonstration on STM32 microcontrollers. (PDF; 787 kB) STMicroelectronics, May 2016, p. 6 , accessed on June 22, 2016 (English).

[11] The decibel. Detlef Mietke, accessed on June 12, 2016 .

[12] Calculate: Gain and loss as a factor in the level in decibels (dB). Eberhard Sengpiel, accessed on June 12, 2016 .

[13] Men Muheim: Design and Implementation of a Commodity Audio System. (PDF; 7.5 MB) 2003, p. 53 , accessed on June 24, 2016 (English, doctoral thesis).

[14] Udo Zölzer: Digital Audio Signal Processing. P. 56 , accessed on June 16, 2016 (English): "the signal-to-noise ratio is independent of the level of the input"

[15] Jonas Ekeroot: Audio Software Development. (PDF; 630 kB) August 9, 2007, p. 29 , accessed on June 24, 2016 (English).

[16] Udo Zölzer: Digital Audio Signal Processing. P. 49 , accessed on June 16, 2016 (English, decimal evaluation of the bits).

[17] W. Kahan: Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic. (PDF; 115 kB) October 1, 1997, accessed on June 12, 2016 (English).

[18] IEEE 754 Converter. Harald Schmidt, accessed on June 5, 2016 (online calculation script for 32 bit floating point calculation).

[19] Decibel (dB) to Float Value Calculator. Play Dot Sound, accessed June 16, 2016 .

[20] Sum level of incoherent sound sources. Alexander Sengpiel, accessed June 16, 2016 .