The entropy coding is a method for lossless data compression , the each individual characters of a text assigns a different long sequence of bits. Typical representatives are the Huffman coding and the arithmetic coding .
Since a certain minimum number of bits is necessary in order to distinguish all characters from one another, the number of bits assigned to the characters cannot be infinitely small. The optimal number of bits that should be assigned to a character with a given probability is determined by entropy .
Entropy encoders are often combined with other encoders. Upstream processes serve to reduce the entropy of the data. Often these are prediction methods , methods such as the Burrows-Wheeler transformation , but often other compressors as well. LHarc, for example, uses an LZ encoder and passes the characters output by this encoder to a Huffman encoder. Also Deflate and Bzip have an entropy encoder as the last stage.
The entropy coding initially determines the frequency of symbols (characters). Each symbol is represented by a specific bit pattern. Frequent symbols should be represented by short bit patterns in order to minimize the total number of bits required.
Mathematical algorithms for estimating the optimal length of the code of each symbol usually lead to non-integer results. When used, the lengths must be rounded to whole numbers, which means that part of the compression is lost.
With whole bits
The Shannon-Fano coding suggests a way of rounding the number of bits to whole numbers. In certain cases, however, this algorithm does not provide the optimal solution. That is why the Huffman code was developed, which can be proven to always deliver one of the optimal solutions with whole bits. Both algorithms generate a prefix-free code of variable length by constructing a binary tree . In this tree, the “leaves” stand for the symbols to be coded and the inner nodes for the bits to be stored.
In addition to these methods, which create individual tables specially adapted to the data to be coded, there are also variants that use fixed code tables. The Golomb code can be used for example in data where the frequency distribution is subject to certain rules. These codes have parameters to adapt them to the exact parameters of the distribution of frequencies.
However, these methods only meet the number of bits prescribed by entropy in exceptional cases . This leads to a compression that is not optimal.
An example: A string with only 2 different symbols is to be compressed. One sign has a probability of , the other of . The methods of Shannon and Huffman mean that both characters are stored with one bit each. This results in an output that contains as many bits as the input of characters.
An optimal entropy coder would only use bits for the character A and instead bits for B. This leads to an output that only contains bits per character ( maximum entropy ).
An arithmetic coder can be used to get closer to the optimal coding. This method implicitly ensures a more even distribution of the bits among the characters to be coded without explicitly constructing an individual code character for each character. But even with this method, the maximum entropy cannot generally be achieved. This is because there is still a “scrap” based on the fact that only integer bit values can actually occur, while the entropy usually requires fractions of bits. In the above example, the arithmetic coder only achieves a code length of one bit. However, the scrap generally disappears as the length of the data to be encoded increases, so that the entropy can be maximized in the limit value.
In order to be able to determine the number of bits for each character, one has to give as precise information as possible about the probability of the individual characters at every point in the coding process. This is the task of the model. The better the model predicts the probabilities, the better the compression. The model must deliver exactly the same values when compressing and decompressing. Various models have been developed over time.
In the static model, statistics are created for the entire sequence before the character sequence is encoded. The probabilities obtained in this way are used to encode the entire character string.
- Coding tables only need to be created once. The procedure is therefore very quick.
- Without the statistics required for decompression, the output is guaranteed not to be larger than the original text.
- The statistics have to be sent to the decoder (either as statistics or in the form of the Huffman or Shannon-Fano codes), which costs memory.
- The statistics refer to the entire character string; H. the values do not adapt to local conditions in the character string.
In this model, the probabilities change over the course of the coding process. There are several options:
- Forward dynamic
- The probabilities relate to characters that have already been coded, that is, after a character has been coded, the probability of this character is increased here.
- Dynamic backwards
- Before coding, the number of times each character appears is counted. Exact probabilities can be determined from these counters. During the coding process, the associated counter is decreased by 1 for each coded character. Since towards the end the counters for all characters tend towards 0, the probabilities for characters that no longer occur also become 0. These characters can no longer be coded. Other characters can be coded with fewer bits because their probability increases. In this way, the coding efficiency increases so much that the last character can be coded with 0 bits.
- Adaptation of the model to local conditions
- Statistics information does not have to be transferred in the forward dynamic model.
- Tables must be revised after each character. That costs computing time.
- The statistics hurry after the real facts. In the worst case scenario, the statistic will never match the true probabilities, resulting in more bits being needed.
Normally one does not work with probabilities in dynamic models, but with the frequencies of the characters.
Dynamic models also allow other possible variations.
Statistics data can be aged by halving the frequency of the characters from time to time. This reduces the influence of characters that are far back.
There are several variants for characters that have never appeared before:
- A frequency of at least 1 is assumed for each character so that all characters can be encoded.
- You add a new character to the alphabet. This sign indicates leaving the context. After these characters have been encoded, all characters of the alphabet can be written or read with a fixed code. The character is usually called an escape character. Some of the best compression algorithms, those in the PPM family , use this approach.
Level of the model
The level of a model relates to how many characters of history are used by the model to calculate the probabilities. A level 0 model does not consider history and gives the probabilities globally. A level 1 model, on the other hand, considers the preceding sign and makes its statement about the probability based on this sign. (If German text is to be coded, for example, the probability of the letter “U” after a “Q” is much higher, or the probability of a capital letter in the middle of a word is much lower than after a space). The level can theoretically be as high as you want, but it then requires enormous storage space and large amounts of data to obtain meaningful statistics.
PPM algorithms use an arithmetic encoder with a level 5 dynamic model.
- Mark Nelson, Jean-Loup Gailly: The Data Compression Book , second edition. M & T Books, New York 1996, ISBN 1-55851-434-1 .
- This book is a relatively good introduction. However, it deals more with the implementation in C than with the theory of data compression.
- Khalid Sayood: Introduction to Data Compression , second edition. Morgan Kaufmann Publishers, San Francisco 2005, ISBN 1-55860-558-4 .
- This book is a standard work on the theoretical and application-relevant basics of data compression.
- Stefan Weinzierl (Ed.): Handbook of audio technology. Springer Verlag, Berlin 2008, ISBN 978-3-540-34300-4 .
- Tilo Strutz: Image data compression. Basics, coding, wavelets, JPEG, MPEG, H.264. 4th revised and supplemented edition, Vieweg + Teubner, Wiesbaden 2009, ISBN 978-3-8348-0472-3 .