Move to front
Move to front is a coding method that is well suited to further processing data from the Burrows-Wheeler transformation so that it can then be compressed more effectively.
Format of the input and output data
The input data for Move-to-Front are a finite alphabet and a character string from this alphabet. The alphabet can be, for example, ASCII or Unicode , but also bytes .
The output from Move-to-Front is a sequence of natural numbers, where each of the numbers is less than the length of the alphabet.
functionality
Move-to-front coding:
- Write the entire alphabet in a string a .
- For each character z of the input:
- Output the position of z in a .
- Remove z from a and put it back in the front.
This sequence means that characters that occur frequently in the input are relatively far to the front of the alphabet a during encoding . This in turn means that the output contains more small numbers than large ones, which in turn is useful for subsequently compressing the sequence of numbers, for example using Huffman coding .
Move-to-front decoding:
- Write the entire alphabet in a string a .
- For each number z of the input:
- Output the character at position z of a .
- Remove this symbol from a and put it back in front.
The decoding works almost exactly like the coding, except that the position at which the alphabet is changed is already known (the number from the input), while it has to be determined during the coding.
example
The string "Mississippi" is to be encoded with the MTF algorithm. The alphabet that is used is (for brevity) "ABCIMPSabcimps".
In the following table, characters are the input characters and the alphabet is the current alphabet. The output is the position of the character in the current alphabet (starting with 0), and Alphabet ' is the new alphabet created by moving the input character to the beginning.
character | alphabet | output | Alphabet' |
---|---|---|---|
M. | ABCIMPSabcimps | 4th | MABCIPSabcimps |
i | MABCIPSabcimps | 10 | iMABCIPSabcmps |
s | iMABCIPSabcmps | 13 | siMABCIPSabcmp |
s | siMABCIPSabcmp | 0 | siMABCIPSabcmp |
i | siMABCIPSabcmp | 1 | isMABCIPSabcmp |
s | isMABCIPSabcmp | 1 | siMABCIPSabcmp |
s | siMABCIPSabcmp | 0 | siMABCIPSabcmp |
i | siMABCIPSabcmp | 1 | isMABCIPSabcmp |
p | isMABCIPSabcmp | 13 | pisMABCIPSabcm |
p | pisMABCIPSabcm | 0 | pisMABCIPSabcm |
i | pisMABCIPSabcm | 1 | ipsMABCIPSabcm |
The result of the coding is the text (4,10,13,0,1,1,0,1,13,0,1). If you omit the rearrangement of the alphabet, you get the text (4,10,13,13,10,13,13,10,12,12,10) for comparison. You can see that after a short "familiarization phase" (here 3 characters: 4,10,13), small numbers are output relatively often, which is good for subsequent compression.
To decode MTF again, you go the opposite way:
The sequence of numbers (4,10,13,0,1,1,0,1,13,0,1) should be decoded using the alphabet "ABCIMPSabcimps". In the following table, position is the number from the sequence of numbers to be decoded and character is the decoded character. The Alphabet and Alphabet columns are exactly the same as in the coding table above.
position | alphabet | character | Alphabet' |
---|---|---|---|
4th | ABCIMPSabcimps | M. | MABCIPSabcimps |
10 | MABCIPSabcimps | i | iMABCIPSabcmps |
13 | iMABCIPSabcmps | s | siMABCIPSabcmp |
0 | siMABCIPSabcmp | s | siMABCIPSabcmp |
1 | siMABCIPSabcmp | i | isMABCIPSabcmp |
1 | isMABCIPSabcmp | s | siMABCIPSabcmp |
0 | siMABCIPSabcmp | s | siMABCIPSabcmp |
1 | siMABCIPSabcmp | i | isMABCIPSabcmp |
13 | isMABCIPSabcmp | p | pisMABCIPSabcm |
0 | pisMABCIPSabcm | p | pisMABCIPSabcm |
1 | pisMABCIPSabcm | i | ipsMABCIPSabcm |
The output of the decoding is "Mississippi" as expected.