Shannon-Fano coding

from Wikipedia, the free encyclopedia

The Shannon-Fano coding is an entropy coding . Characters are encoded according to their frequency of occurrence so that they have the smallest possible average word length. The entropy - the mean information content per symbol - gives a lower limit ( source coding set ).

Basic idea

It is known from entropy coding that characters have to be coded with a different number of bits if one wants to obtain a memory-saving representation. However, it is not possible to simply assign arbitrary bit sequences to the characters and output them.

If, for example, the character A is coded with the bit sequence “10”, the character B with the sequence “01” and C with “0”, the character string ABC becomes “10010”. However, this sequence is also generated by the string "ACA". It is no longer clearly recognizable which character sequence the bit sequence “10010” stands for.

In order to avoid this ambiguity, the bit sequences assigned to the characters must be prefix-free , i. H. no bit sequence that stands for a single character may be at the beginning of another character. This property is not fulfilled in the example above, since the bit sequence “0”, which stands for C, is at the beginning of the sequence assigned to B.

Prefix-free code

The basic idea is to use a full binary tree to represent the code. In the tree, the leaves represent the characters to be coded, while the path from the root to the leaf determines the bit sequence.

The picture shows a tree that is of the desired shape. The inner nodes are numbered so that they can be named.

In order to determine the bit sequence that should stand for one of the characters, it must be specified how the branches are to be coded. A variant is to say that left subtrees are coded with a 0 and right subtrees with a 1. This results in the following bit sequences for the example tree:

A. B. C. D. E.
00 01 100 101 11

Many other codings are just as possible. Here are just a few more examples:

A. B. C. D. E.
10 11 000 001 01
11 10 010 011 00
11 10 001 000 01

If you want to encode a character string, the bit strings assigned to the corresponding characters are output one after the other. The first encoding of the character sequence “ACDC” becomes the bit sequence “00100101100”.

In order to decode this bit sequence, it is processed bit by bit. The decoder must remember the current position in the tree. It starts at the root, ie in node 1. Then the first bit is read, a 0. That means with this coding to the left, the current node becomes 2. Another 0 bit follows. The current node is now A. A is output and the current node is set to 1 again. The next 1 bit read results in the current node being the 3, and so on.

The problem to be solved now is to create such a tree in which the average number of questions until a character is clearly determined is as small as possible on average, i.e. as close as possible to the entropy.

Shannon-Fano coding and Huffman coding are two different algorithms for constructing these trees. In contrast to the Huffman coding, the Shannon-Fano coding is not always optimal.

Shannon-Fano

The algorithm named after Claude Shannon and Robert Fano works with the following rule:

  • Sort the characters according to their frequency.
  • Divide the characters in this order into two groups so that the sum of the frequencies in the two groups is as equal as possible. The two groups correspond to the left and right subtree of the Shannon tree to be created. The resulting partitioning is not necessarily the best that can be achieved with the given frequencies. The best partitioning is also not sought, as this does not necessarily lead to a better result. What is important is that the mean deviation from the entropy of the characters is as small as possible.
  • If there is more than one character in one of the resulting groups, apply the algorithm recursively to this group.

The important element in this algorithm is the first step. This ensures that characters with similar probabilities often end up in the same subtree. This is important because nodes in the same tree receive bit sequences with similar code lengths (the maximum difference can only be the number of nodes in this tree). If the probabilities are very different, it is not possible to generate bit sequences with the necessary differences in length. This means that rare characters get bit sequences that are too short and frequent characters that are too long bit sequences.

example

Shannon-Fano example

The example shows the construction of the Shannon code for a small alphabet. The five characters to be encoded have the following frequencies:

character A. B. C. D. E.
frequency 15th 7th 6th 6th 5

The values ​​are already sorted, the next step is the partitioning. At the beginning all characters are in a partition (in picture a).

Half the probability sum is 19.5. 15 is 4.5 below this half value, 15 + 7 on the other hand only 2.5 above - there is no better partitioning. The alphabet is divided into 2 parts so that one part contains the characters A and B and the other part contains the rest (C, D and E) (picture b). Both partitions still contain more than 1 character, so they have to be divided further. The partition with A and B can only be divided into 2 parts with one character each. The other group has 2 options.

6 + 6 is further from the center than 6 alone. So it is divided into the two partitions with C and D + E (picture c). Finally, D and E are divided. The resulting decision tree is shown in Figure section d.

The characters A, B and C were assigned 2 bits, D and E 3 bits. The probability of occurrence of A is 15/39, of B 7/39, of C and D 6/39 and of E 5/39. This results in the mean code word length :

Since the assignment is not clear, here is an example of a possible coding based on this tree:

character A. B. C. D. E.
Coding 00 01 10 110 111

See also

Individual evidence

  1. In fact, they only have to be unique, but for each unique code there is also a prefix-free code with an identical code word length