Arithmetic coding

The arithmetic coding is a form of entropy coding which in the lossless data compression is used and achieves compression rates which very close to the theoretical limit of the entropy lie. The founder of arithmetic coding is Jorma Rissanen , who from 1976 to the beginning of the 1980s did significant work in this branch of information theory.

Conventional encodings are based on the representation of the information with a fixed number of integer bits, for example in ASCII code with 7 bits, or frequently used characters are stored with fewer bits and less frequent characters are stored with more bits, as is shown in the Huffman coding is the case. The arithmetic coding differs from this in the essential point that the source information is not divided into individual components, but the entire source information or a longer part of it is represented as a rational number . The mapping is usually in the form of an arbitrarily precise fraction q consisting of two natural numbers with the interval limits 0.0 ≤ q <1.0. There are also variations of the arithmetic coding for a shorter calculation time, which only use a single natural number of any length to represent information. In general, arithmetic coding is more computationally intensive than conventional methods which form code words with a number of integer bits.

Basic principle

Theoretically, the method works with infinitely precise real numbers , even if the actual implementations then have to fall back on finitely precise integer , fixed point or floating point numbers . However, this always means that rounding has to be carried out and the result is no longer optimal.

The inputs for the arithmetic encoder are referred to below as a symbol . The output to the encoder is a real number (here denoted by x ).

First, the encoder and decoder must agree on an interval in which the number x should be located. Usually the range between 0 and 1 (exclusive) is used here, i.e. the half-open interval (see). ${\ displaystyle [0,1) = \ {x \ in \ mathbb {R} \, | \, 0 \ leq x <1 \}}$

In addition, when decoding or encoding a character, encoders and decoders must always have identical tables available with the probabilities of all possible decodable characters. The model is responsible for providing these probabilities .

One possibility is to create a frequency analysis especially for the input data before coding and to communicate this to the decoder in addition to the actual message. Encoders and decoders then use this table for all characters.

Encoder

Initialize the current interval with the agreed start interval - as mentioned above, this is usually the interval . ${\ displaystyle [0,1)}$
Divide the current interval into sub- intervals , with each encodable character being assigned a sub-interval. The size of the sub-intervals is determined by the frequency of occurrence of the character ( probability of occurrence ). The order of the intervals is determined by an agreement (e.g. in alphabetical order).

The subinterval that corresponds to the next character in the entry becomes the current interval .

If there are still more characters to be coded, then continue with point 2. Otherwise continue with the next point.

Output any number from the current interval plus the number of encoded characters. This is the number x that can be decoded by the decoder as described above. This number is selected in such a way that it has as few decimal places as possible, ie is as "round" as possible and can therefore be represented with relatively few bits .

Decoder

The decoder can now decode one character at a time by doing the following:

Initialize the current interval with the agreed interval - usually the interval .

{\ displaystyle [0,1)}

Split the current interval into sub- intervals in the same way as the encoder, and assign a character to each sub-interval (see above).

Find out in which of these sub-intervals the number x lies and output the character that is assigned to this sub-interval. In addition, this sub-interval now becomes the current interval .

If there are still more characters to decode, continue with point 2 with the new current interval .

It is noticeable with this algorithm that it does not terminate : The number x alone does not tell when the last character was decoded. The decoder must always be informed by additional information when it has finished its work. This is usually implemented in the form of a length specification, but can also (e.g. if only a single data pass is required for coding) by means of a special character with the meaning “end”.

Interval division

The sub-intervals must be chosen so that the encoder and decoder determine the size and position equally. As mentioned above, the size of the sub-intervals results from the probabilities of the characters.

The order (order) of the intervals, on the other hand, is not important for the quality of the algorithm, so that any order can be specified here. This is the subject of an agreement (e.g. alphabetical order).

One way to calculate the intervals is as follows:

{\ displaystyle {\ begin {aligned} I_ {l} (n + 1) & = I_ {l} (n) + I_ {s} L \\ I_ {h} (n + 1) & = I_ {l} (n) + I_ {s} H \ end {aligned}}}

${\ displaystyle I_ {l}}$ and are the limits of the interval. is the length of the interval, so . The two values and are the sum of the probabilities of all characters with an index less than or equal to the character to be encoded . ${\ displaystyle I_ {h}}$ ${\ displaystyle I_ {s}}$ ${\ displaystyle I_ {h} -I_ {l}}$ ${\ displaystyle L}$ ${\ displaystyle H}$ ${\ displaystyle x}$

{\ displaystyle {\ begin {aligned} L & = \ sum p (i), i <x \\ H & = \ sum p (i), i \ leq x \ end {aligned}}}

If, for example, an alphabet A = {A, B, C} is used, the probabilities of which are p (A) = 0.75, p (B) = 0.125, p (C) = 0.125 and the next character x to be coded is C, then will and calculated as follows: ${\ displaystyle L}$ ${\ displaystyle H}$

{\ displaystyle {\ begin {aligned} L & = p (A) + p (B) = 0.875 \\ H & = p (A) + p (B) + p (C) = 1 \ end {aligned}}}

example

Interval nesting with arithmetic coding

In this example the string "AAABAAAC" is compressed. First, the frequencies of all characters are required for the size of the subintervals. For the sake of simplicity, a static probability is used for all characters.

character	frequency	probability	optimal number of bits
A.	6th	p (A) = 75%	${\ displaystyle I (A) = - \ log _ {2} (A) = - \ log _ {2} (0 {,} 75) = 0 {,} 415}$
B.	1	P (B) = 12.5%	${\ displaystyle I (B) = - \ log _ {2} (B) = - \ log _ {2} (0 {,} 125) = 3}$
C.	1	P (C) = 12.5%	${\ displaystyle I (C) = - \ log _ {2} (C) = - \ log _ {2} (0 {,} 125) = 3}$

The optimal number of bits results from the formula for the entropy . With this value it can be calculated that the information content of the character string corresponds to 8.49 bits.

${\ displaystyle I _ {\ text {ges}} = \ sum _ {i = 1} ^ {3} n_ {i} \ cdot I ({z_ {i}}) = 6 * 0 {,} 415 + 1 * 3 + 1 * 3 = 8 {,} 49}$

Now to the process. The following table shows the exact values for the sub-intervals after coding the individual characters. The graphic on the right illustrates the selection of the sub-intervals again.

	interval		Interval size
	0 -	1	1
A.	0 -	0.75	0.75
A.	0 -	0.5625	0.5625
A.	0 -	0.421875	0.421875
B.	0.31640625 -	0.369140625	0.052734375
A.	0.31640625 -	0.35595703125	0.03955078125
A.	0.31640625 -	0.3460693359375	0.0296630859375
A.	0.31640625 -	0.338653564453125	0.022247314453125
C.	0.335872650146484375 -	0.338653564453125	0.002780914306640625

Any number from the last interval that is as short as possible is saved, e.g. B. 0.336.

This corresponds to between 8 and 9 bits. The Huffman coding , on the other hand, would have required 10 bits for the given string (1 for each A and 2 each for B and C)

The difference in this example is 10%. The profit increases when the number of bits actually used by Huffman coding deviates more from the optimal one, i.e. when a character occurs extremely frequently.

The decoder takes this number to decode. The following takes place:

It starts with the start interval [0; 1]. This interval is divided into three sub-intervals by the decoder (as in the first line of the picture).
0.336 lies in the subinterval A [0; 0.75]. So the first character is A.
The current subinterval is [0; 0.75]
[0; 0.75] is broken down again according to the familiar scheme
0.336 is again in the first interval [0; 0.5625], i.e. output A.
[0; 0.5625] is broken down again according to the familiar scheme
0.336 is again in the first interval [0; 0.421875], i.e. output A.
[0; 0.421875] is broken down again according to the familiar scheme
This time 0.336 is in the second interval [0.31640625; 0.369140625], so B will print.
[0.31640625; 0.369140625] is broken down again according to the familiar scheme
...

Optimality

Arithmetic coding is asymptotically optimal:

After the last symbol has been processed, you get an interval with ${\ displaystyle [r, r + s)}$

{\ displaystyle s = \ prod _ {i = 1} ^ {N} p (x_ {i})}

This corresponds to the probability of receiving exactly such a sequence given the symbol probabilities . ${\ displaystyle p (x_ {i})}$

In order to specify a binary value in the interval , one needs ${\ displaystyle [r, r + s)}$

at least bits, if ${\ displaystyle - \ lceil \ log _ {2} (s) \ rceil = - \ log _ {2} (s)}$ ${\ displaystyle \ exists \; x = 0 \ mod 2 ^ {- \ lceil \ log _ {2} (s) \ rceil}, x \ in [r, r + s)}$
at most, however, bits (to describe the interval with an accuracy of s / 2, which in the binary system is just enough to distinguish whether the value is within). ${\ displaystyle - \ lceil \ log _ {2} (s) \ rceil +1}$

There

{\ displaystyle {\ begin {aligned} - \ log _ {2} (s) & = - \ log _ {2} {\ big (} \ prod _ {i = 1} ^ {N} p (x_ {i }) {\ big)} \\ & = - \ sum _ {i = 1} ^ {N} \ log _ {2} p (x_ {i}) \\ & =: n \ cdot H (X) \ end {aligned}}}

and

{\ displaystyle - \ lceil \ log _ {2} (s) \ rceil +1 <- \ log _ {2} (s) +2}

we can estimate the length of the arithmetically coded sequence: ${\ displaystyle l}$

{\ displaystyle n \ cdot H (X) \ leq l <n \ cdot H (X) +2}

This means that you need at least as many bits as the entropy , but no more than two bits more.

The mean length of a coded symbol is limited to ${\ displaystyle l_ {Sym} = {\ frac {l} {n}}}$

{\ displaystyle H (X) \ leq l_ {Sym} <H (X) + {\ frac {2} {n}}}

For long sequences, these (at most two) additional bits are evenly distributed over all symbols, so that the mean length of a coded symbol then asymptotically goes against the true entropy:

{\ displaystyle \ lim _ {n \ to \ infty} {\ Big (} H (X) + {\ frac {2} {n}} {\ Big)} = H (X)}

Compared to Huffman coding

If all symbol probabilities can be represented in the form , then arithmetic coding and Huffman coding produce an identically long data stream and are equally (i.e. optimally) efficient. In practice, however, this is almost never the case. ${\ displaystyle p_ {i}}$ ${\ displaystyle p_ {i} = 2 ^ {- k_ {i}}, k_ {i} \ in \ mathbb {N ^ {+}}}$

implementation

Since one can not work with infinitely precise real numbers in a concrete implementation , the concrete implementation of the algorithm has to be done somewhat differently. The maximum precision of the numbers is generally fixed (e.g. 32 bits) and cannot exceed this value. This is why an arithmetic coder cannot be implemented on a real computer.

To get around the problem of limited accuracy, two steps are taken:

The places after the upper limit of accuracy are cut off. As a result, the interval sizes no longer exactly match the required values. This leads to a deterioration in the result.
The interval has to be increased from time to time, otherwise the accuracy of the numbers will be used up after a few coded characters. Therefore, higher-value digits that are fixed are output and removed from the numbers. In the example above, after coding the character B, you can safely say that the result number starts with 0.3. So you can already output 0.3 here and subtract it from the interval limits. The interval limit is then scaled by 10 and the calculation continues with this value.

Point 1 actually means that the algorithm is no longer an arithmetic coder, but just similar. There are, however, some independent algorithms that derive from the arithmetic coder; these are:

The range coder: This coder is a relatively straightforward implementation of the arithmetic coder with integers .
The Q-Coder ( developed and patented by IBM ): This also simplifies the alphabet to just two characters. This procedure allows the interval division to be approximated with additions instead of multiplications as with the range coder.
The ELS coder: This also only works with two characters, but is more efficient with characters that are relatively equally probable, while with the Q-Coder both characters should have as different probabilities as possible.

Despite these methods, various problems remain with arithmetic coding:

speed: Arithmetic encoders are relatively expensive and slow. With the range coder, a division must be carried out for each character . The other encoders require the encoding process to be carried out several times for all bits of the character.
Patents: Most of the software patents in the field of arithmetic coding were granted in the 1980s and early 1990s and have now expired. The Q-Coder is permitted alongside the Huffman Coder for JPEG , but is almost never used because it was patented by IBM.
Small profit: Different methods can be used to ensure that the much faster Huffman coding only delivers marginally worse results than the complex arithmetic coder. This includes that sometimes strings are treated as independent characters. This reduces the “waste” that arises from the fact that each character is represented with an integer bit length.

Web links

Elaboration on the basics of arithmetic coding, including source text (E. Bodden et al. 2002; PDF file; 568 kB)
Q-coder site from IBM
Range coder website, source code for download
Description of the range coder's algorithm in pseudocode
Chapter 6 of the electronic book Information Theory, Inference, and Learning Algorithms by David JC MacKay explains arithmetic coding.

Individual evidence

↑ Jorma Rissanen: Generalized Kraft inequality and arithmetic coding. Ed .: IBM Journal of Research and Development. No. 20 . IBM (company publication), 1976, p. 198-203 .
↑ Khalid Sayood: Introduction to Data Compression . Morgan Kaufmann, 2000, ISBN 978-1-55860-558-9 , Chapter 4: Arithmetic Coding .
↑ Strutz: Image data compression. SpringerVieweg, 2009
↑ Guy E. Blelloch, Introduction to Data Compression (PDF; 408 kB), p. 18, 2010
↑ Amir Said, Introduction to Arithmetic Coding - Theory and Practice (PDF; 462 kB), p. 13
↑ E. Bodden, et al., Elaboration on the basics of arithmetic coding, including source text (PDF; 581 kB), 2002, p. 41

[riss76-1] Jorma Rissanen: Generalized Kraft inequality and arithmetic coding. Ed .: IBM Journal of Research and Development. No. 20 . IBM (company publication), 1976, p. 198-203 .

[Khal1-2] Khalid Sayood: Introduction to Data Compression . Morgan Kaufmann, 2000, ISBN 978-1-55860-558-9 , Chapter 4: Arithmetic Coding .

[Strutz-3] Strutz: Image data compression. SpringerVieweg, 2009

[Blelloch10-4] Guy E. Blelloch, Introduction to Data Compression (PDF; 408 kB), p. 18, 2010

[Amir04-5] Amir Said, Introduction to Arithmetic Coding - Theory and Practice (PDF; 462 kB), p. 13

[Bodden02-6] E. Bodden, et al., Elaboration on the basics of arithmetic coding, including source text (PDF; 581 kB), 2002, p. 41