Area coding

from Wikipedia, the free encyclopedia

The range coding (engl. Range encoding ) is a data compression method for entropy coding , one type of arithmetic coding realized.

Range coding is often viewed as an alternate description of arithmetic coding and essentially identical to it. It is based on the 1979 document "Range encoding: an algorithm for removing redundancy from a digitized message" (English, in German about range coding, an algorithm to remove redundancy from digitized messages ), by G. N. N. Martin. Due to the age of the document, it is assumed that implementations of the arithmetic coding method described therein are not affected by the patents on arithmetic coding. This has sparked interest in the technology , especially in the free software community.

functionality

In principle, area coding encodes all symbols in a message in a number - in contrast to Huffman coding , which assigns a bit pattern to each symbol and puts the bit patterns back in sequence. Therefore, the area coding does not have to use the upper limit of having to use at least one bit for a symbol like this and also does not suffer like this from the inefficiency in dealing with occurrence probabilities that are not exactly a multiple of two. In this way, better compression rates can be achieved.

The central concept behind the range coding is as follows: If you have a sufficient range of integer values as symbols and an evaluation of the probability of occurrence of the symbols, the original range can easily be divided into sub-areas, the sizes of which are proportional to the probability of occurrence of the Are symbols that they represent. This means that each symbol of the message can be encoded by reducing the value range to the sub-range that corresponds to the next symbol to be encoded. The decoder must have the same probability assumption as the encoder, which can either be transmitted beforehand, obtained from data already transmitted or be part of the encoder and decoder.

When all symbols have been coded, it is sufficient for the message to be transmitted to display the sub-area, provided of course that the decoder knows when the message is again complete. A single integer is sufficient to indicate the sub-range, and it does not even have to be necessary to transmit the entire integer; the first of the digits may be sufficient to uniquely describe the original message.

example

The message "AABA <EOM>" is to be encoded, where <EOM> indicates the end of the message . For this example it is assumed that the decoder knows that a decimal system , a range of [0, 100000) and the frequency rating {A: 0.60; B: 0.20; <EOM>: 0.20} is used. The first symbol divides the area [0, 100000) into three sub-areas:

A:     [     0,  60000)
B:     [ 60000,  80000)
<EOM>: [ 80000, 100000)

Since the first symbol is an A, this reduces the original range to [0, 60000). The second symbol decision divides this sub-area into three sub-areas, shown below after the already coded 'A':

AA:     [     0,  36000)
AB:     [ 36000,  48000)
A<EOM>: [ 48000,  60000)

After two symbols have already been coded, the area remains [000000, 036000) and the third symbol decides between the following areas:

AAA:     [     0,  21600)
AAB:     [ 21600,  28800)
AA<EOM>: [ 28800,  36000)

This time it is the second of the three possibilities that describes the desired message and the range is limited to [21600, 28800). In this case, it may seem more difficult to identify the sub-ranges, but they result quite simply: By subtracting the lower limit value from the upper limit, the result is that the range comprises 7200 numbers, which are divided according to the probability pattern into 4320 for the first, 60- Share and 1400 each for the next two, 20 shares of the total (sub) range. Adding back to the lower limit of the total (partial) range results in the following sub-ranges:

AABA:     [21600, 25920)
AABB:     [25920, 27360)
AAB<EOM>: [27360, 28800)

In order to code the last symbol, the range is now limited to [21600, 25920). According to the method just described, this is now divided between the limit values ​​into the following sub-areas:

AABAA:     [21600, 24192)
AABAB:     [24192, 25056)
AABA<EOM>: [25056, 25920)

The last symbol <EOM> defines the final area to [25056, 25920). Since all numbers that begin with "251" fall into this range, the message is clearly described by this. (Since eight such unique numbers actually exist, the coding is still not optimal; in that the decimal system was used instead of, for example, the binary system .)

It may now appear as the main problem to choose a sufficiently large area at the beginning to encode all symbols without leaving the integers or getting zero during the division. In practice, however, this problem does not arise because the coder can gradually increase the number and can already output the first digits that are already fixed and no longer need them for the calculations. A small number range is used at all times by outputting fixed numbers and adding some on the other side. In the example, after the first three symbols had been processed, "2" was the first digit of the result and should not have been included in the calculation.

See also

credentials

  1. http://www.compressconsult.com/rangecoder/#download

Web links