Tunstall coding

The Tunstall coding is a form of lossless data compression and entropy coding , which in 1967 by Brian Parker Tunstall in his doctoral thesis at Georgia Institute of Technology has been developed. In contrast to similar methods such as Huffman coding , Tunstall coding assigns a code symbol with a fixed number of bits (positions) to a source symbol with a variable length .

Procedure

Search tree in the first iteration step

The data coding of the character string "hello_world" should serve as an example. For the sake of simplicity, it is further assumed that the scope of the source symbols, the so-called alphabet, should only include the 8 symbols { dehlorw_ }. In this case, the occurrence probability of each character can be indicated directly in the string to be coded: For example, the sign is 'l' in the 11-character string three times before which an occurrence probability of ³ / ₁₁ corresponds.

For the first iteration step and the structure of the search tree , the individual probabilities of occurrence of all 8 source symbols are determined.

Search tree in the second iteration step

The entropy coding is improved by further iteration steps. In the second iteration step, the sheet is taken with the highest occurrence probability from the tree, in this case, the symbol l with ³ / ₁₁ , and all the probabilities for the symbol l followed formed by one of the 8 other possible symbols. In this case, too, the probabilities of occurrence of the individual symbols or symbol sequences are formed and sorted in descending order in the tree, as shown in the second figure. The probability of occurrence of the symbol sequence ll in this example is:

{\ displaystyle \ left ({3 \ over 11} \ right) ^ {2} = {9 \ over 121}}

This results in a total of 15 different code words which can be coded with bits. The notation stands for the so-called Gauss brackets . For example, as shown in the figure, the symbol 'h' is assigned the code word w ₁ = "0000". The process stops when the number of code words has reached or exceeded an initially defined limit. This results in the following figure: ${\ displaystyle \ lceil \ log _ {2} (15) \ rceil = 4}$ ${\ displaystyle \ lceil \ cdot \ rceil}$

word	code
w ₁ = "h"	0000
w ₂ = "e"	0001
w ₃ = "lh"	0010
w ₄ = "le"	0011
w ₅ = "ll"	0100
w ₆ = "lo"	0101
w ₇ = "l_"	0110
w ₈ = "lw"	0111
w ₉ = "lr"	1000
w ₁₀ = "ld"	1001
w ₁₁ = "o"	1010
w ₁₂ = "_"	1011
w ₁₃ = "w"	1100
w ₁₄ = "r"	1101
w ₁₅ = "d"	1110
	1111

If the Tunstall coding in this example were to be ended after the second iteration step with a length of 15 code words, the character sequence "hello_world" would represent the following binary code sequence encoded in Tunstall, including the assigned source symbols:

0000 0001 0100 1010 1011 1100 1010 1101 1001
 h    e    ll   o    _    w    o    r    ld

literature

Martin Bossert: Applied Information Theory. (PDF; 815 kB) script for the lecture. (No longer available online.) Ulm University, April 2007, archived from the original ; Retrieved April 5, 2013 .

Individual evidence

^ Brian Parker Tunstall: Synthesis of noiseless compression codes . Georgia Institute of Technology , December 1967.
↑ Variable to fixed Length Source Coding - Tunstall Codes. (PDF; 715 kB) Massachusetts Institute of Technology, October 1994, accessed April 5, 2013 .

[tuns1-1] Brian Parker Tunstall: Synthesis of noiseless compression codes . Georgia Institute of Technology , December 1967.

[mit1-2] Variable to fixed Length Source Coding - Tunstall Codes. (PDF; 715 kB) Massachusetts Institute of Technology, October 1994, accessed April 5, 2013 .