Burrows-Wheeler Transformation

from Wikipedia, the free encyclopedia

The Burrows-Wheeler Transformation ( BWT ) is an algorithm that is used in data compression techniques such as bzip2 , but does not perform any data compression itself. The transformation was developed by Michael Burrows and David Wheeler in 1994 at the DEC Systems Research Center (SRC) in Palo Alto and is based on an unpublished transformation by Wheeler from 1983.

properties

The BWT is an algorithm that uses a block of data (input) to generate a block of data of the same size (output) and a small amount of additional information (an index). The output is a permutation of the input, that is, the character frequency of input and output is the same, only the order can change. The output is generally easier to compress because the same characters appear one behind the other more often than in the input. The input data can be restored from the output data and the index, i.e. the transformation can be reversed.

The Burrows-Wheeler transform depends on a number of parameters:

  • The character set used: In most cases, the BWT is applied to bytes with 8 bits.
  • The sort order of the characters is relatively irrelevant as long as the characters are fully ordered .
  • The index, which is the output of the BWT, can be 1-based or 0-based.

It is important that these parameters are the same for the forward and backward transformations.

Forward transformation

Input:

  • the data to be encoded

Output:

  • the encoded data
  • an index

First, all possible rotations of the data to be encoded are generated by shifting one character from the end of the data to the beginning. All of these rotations are written in a table. This table is sorted lexicographically. The last characters of each line - read from top to bottom - result in the coded text. The index is the number of the line in which the data to be encoded occurs in the table.

principle

The efficiency of the Burrows-Wheeler transformation is based on the fact that in every language certain parts of words / syllables usually only begin with a few letters (e.g. (g) eg-ang, (g) eg-en, ...) . By sorting the table described above, such word components are piled up. Since the last column of the table contains the character in front of it, the characters that are most likely to begin the respective word / text segment always accumulate in the output of the algorithm.

To illustrate this effect, the section from the sorted table of the transformed Wikipedia article "Fog" is shown below (the first character is always the last column of the table, i.e. the output):

D - ie Tröpfchendurchmesser innerhalb eines N
D - ie Unterscheidung von Nebeln in bestimmte
d - ie Verfügbarkeit von Wasserdampf und zum
d - ie die Luft enthalten kann, ohne dass Kon
w - ie die vor allem thermischen Oberflächene
s - ie häufig in Richtung von Siedlungsräumen
D - ie höchste Nebelhäufigkeit zeigt sich dab
w - ie in Auftriebsbereichen der Fall. Die wa
s - ie jedoch auch stark zwischen den einzeln
d - ie maximale Wasserdampfmenge, die die Luf
d - ie ohne bestehende Oberflächen nicht mögl
. - ...
. - ...
. - ...
l - ichen Nebeldichte führen, man spricht von
l - ichen Nebelhäufigkeit erhöht ist bzw. man
l - ichen Skalenbereiche können dabei stark s
l - ichen Ursachen und wird im Abschnitt Nebe
e - ichen der Fall. Die wahrgenommene Nebelhä
l - icher Ursachen den Taupunkt erreicht. Die
e - icher einschätzt. Auch die räumlichen Ska
n - icht allzu häufig geschieht. Wenn Nebel b
n - icht möglich ist. So kann dann auch vor a

Sample code in Lua

function BWT_vorwaerts(text)
  local len = string.len(text)

  -- Tabelle mit allen Rotationen des Textes erzeugen
  local matrix = {}
  for i = 1, len do
    matrix[i] = string.sub(text, i) .. string.sub(text, 1, i - 1)
  end

  -- Tabelle sortieren
  for i = 1, len do
    for j = i + 1, len do
      if matrix[i] > matrix[j] then
        matrix[i], matrix[j] = matrix[j], matrix[i]
      end
    end
  end

  -- Aus jeder Zeile das letzte Zeichen nehmen
  local codiert = ""
  local index = -1
  for i = 1, len do
    codiert = codiert .. string.sub(matrix[i], -1)
    if matrix[i] == text then
      index = i
    end
  end

  return codiert, index
end

Optimization possibilities

It is not necessary to save the entire table because all lines contain the same text, just start at different places. It is therefore sufficient to save the text only once. Each table line then only consists of a pointer to the start of the character string. The larger the data block to be transformed, the more this optimization is worthwhile. For example, if you transform 500 kilobytes and store each table row separately, you need 500,000 rows of 500,000 bytes each, i.e. 250 gigabytes , just to accommodate the table in memory. However, if you use pointers, you only need 500,000 bytes once for the text and 4 bytes per line for the pointer (with 32-bit memory addresses ), so a total of 2.5 megabytes .

example

Input:

  • Data to be encoded: " .ANANAS. "

First all rotations are generated and written to a table:

1: .ANANAS.
2: ..ANANAS
3: S..ANANA
4: AS..ANAN
5: NAS..ANA
6: ANAS..AN
7: NANAS..A
8: ANANAS..

Then this table is sorted alphabetically, the point is sorted at the end.

1: ANANAS..
2: ANAS..AN
3: AS..ANAN
4: NANAS..A
5: NAS..ANA
6: S..ANANA
7: .ANANAS.
8: ..ANANAS

The last characters of each line are written one after the other and result in the output text. The input text appears in the 7th line, therefore the index is 7.

Output:

  • Coded data: " .NNAAA.S "
  • Index: 7

Inverse transformation

Input:

  • coded text
  • an index

Output:

  • uncoded text

There are several methods for the inverse transformation, which are explained below. If you only have the coded text as input, you can determine the order of the characters in the uncoded text. But that still leaves as many possibilities as the text has characters. So that the reversal is clear, you need a number that indicates where in the coded text the uncoded text begins. This number can be calculated from the index.

Original procedure

In order to transform the text back, the text is sorted using a stable sorting process , and when sorting, attention is paid to which character ends up in which position. This gives an assignment between the coded position (in the last line of the coding table) and the sorted position (in the first line of the coding table). This assignment is a permutation with only one cycle , that is, when you start to run through the permutation at a certain position, you reach all elements once and only then end up at this position again. This is exactly what is done with the reverse transformation, because in this pass you pass all characters of the text in the order in which they were arranged before the forward transformation. By starting at the index, you get the characters in exactly the order they were in before the forward transformation.

Sample code in Lua

function BWT_rueckwaerts(text, index)
  local len = string.len(text)

  -- Zeichen mit zugehörigen Positionen in einer Tabelle speichern
  local tabelle = {}
  for i = 1, len do
    tabelle[i] = { position = i, zeichen = string.sub(text, i, i) }
  end

  -- Diese Tabelle nach den Zeichen sortieren. Wichtig ist hier,
  -- ein ''stabiles'' Sortierverfahren einzusetzen.
  for i = 1, len - 1 do
    for j = 1, len - 1 do
      if tabelle[j].zeichen > tabelle[j + 1].zeichen then
        tabelle[j], tabelle[j + 1] = tabelle[j + 1], tabelle[j]
      end
    end
  end

  -- Beim Index beginnend einmal durch die Tabelle
  -- wandern und dabei alle Zeichen aufsammeln.
  local decodiert = ""
  local idx = index
  for i = 1, len do
    decodiert = decodiert .. tabelle[idx].zeichen
    idx = tabelle[idx].position
  end

  return decodiert
end

example

The data (text: " a! IepdWkii ", index: 2) should be transformed back. The sort order is: exclamation mark, upper case, lower case (as in ASCII ).

The data is sorted using a stable sorting process, and when sorting, attention is paid to which character ends up in which position.

coded text a ! i e p d W. k i i
position 1 2 3 4th 5 6th 7th 8th 9 10
sorted text ! W. a d e i i i k p
sorted position 2 7th 1 6th 4th 3 9 10 8th 5

Example: In the coded text, the capital "W" was in position 7, after sorting it is in position 2, together with the information that it comes from position 7. The stable sorting process is important in order not to confuse the order of the i s. In line 2 they are in positions 3, 9 and 10, and in exactly this order they are also in line 3.

The line of coded text is no longer needed. To reverse transform the text, start with the index, i.e. 2. Is one in the sorted text at position 2 W . The position of the next character results from the line sorted position , i.e. 7. There is an i , and it continues at position 9. There is a k , then 8 (i), 10 (p), 5 (e), 4 (d), 6 (i), 3 (a), 1 (!), 2 (the index). The original text was “ Wikipedia! ".

More detailed example

The text " .NNAAA.S " is to be transformed back. In the sorted table it was in line 7.

The complete sorted matrix can be reconstructed step by step from this data, and when that is done, the original text can be found in line 7.

The first column of the matrix can be easily reconstructed because it contains all characters of the text, just sorted:

 1: A______.
 2: A______N
 3: A______N
 4: N______A
 5: N______A
 6: S______A
 7: .______.
 8: .______S

If you now consider that the same text is in all lines, only rotating, you can gradually complete the matrix. For a better overview you can write the last column again before the first column.

    8│12345678
 ────┼────────
 1: .│A______.
 2: N│A______N
 3: N│A______N
 4: A│N______A
 5: A│N______A
 6: A│S______A
 7: .│.?_____.
 8: S│.?_____S

You can now see that a "." either an "A" follows (line 1) or another "." (Line 7). These two characters must be in line 7 and 8 at position 2 (marked with a question mark). Since, as already said, the matrix is ​​sorted and the first character in these lines is the same, the "A" must be in line 7 and the point in line 8. You do the same with the other characters: Look for all subsequent characters, sorted them, and enter them from top to bottom in the lines that end with this symbol. This gives column 2.

  • "A" is followed by "N", "N" and "S", which come in lines 1, 2 and 3.
  • "N" is followed by "A" and "A", which come in lines 4 and 5.
  • "S" is followed by ".", Which comes in line 6.
  • On "." followed by "A" and ".", which come in lines 7 and 8.
    8│12345678
 ────┼────────
 1: .│AN_____.
 2: N│AN_____N
 3: N│AS_____N
 4: A│NA_____A
 5: A│NA_____A
 6: A│S._____A
 7: .│.A_____.
 8: S│.._____S

Unfortunately, this procedure no longer works for the other columns. (Example: The successors of "A" are "N", "N" and "S", and if you enter them from top to bottom in the lines that end with "A", line 7 suddenly says " .AS ____. ", And that can't be true.) But if you look closely you can see that the character sequence" S .. "occurs somewhere in the word (line 8). And there is only one way to continue this string, namely with "..A" (line 7). Then comes ".AN" (line 1), and now there is the next problem: either line 4 or line 5 could come, because both contain the characters "ANA". But there is also a solution for this problem.

If you remember when reconstructing column 2 from which line the characters come and in which they are inserted, you get the following table:

Column 2 in line ... 1 2 3 4th 5 6th 7th 8th
... comes from line ... 4th 5 6th 2 3 8th 1 7th

This table fits surprisingly well with the successors that can be determined by looking closely. Because the successor to line 8 is line 7, the successor to 7 is 1, and that's the same in the table. The table also gives the answer to the ambiguity problem. The correct order of the lines can now be read from the table by starting with 7 (this is the line number in which the original text was) and writing down the number from the bottom line. Then you look for that number in the top line and so on. This gives the order 1, 4, 2, 5, 3, 6, 8, 7.

In the last step you look for each of these numbers to see what is in the last column of the corresponding line. In line 1 this is a ".", In line 4 an "A", in line 2 an "N", and if you put all these characters together, you get the text " .ANANAS. ".

Alternative method

This method is more computationally expensive than the original method and is therefore more suitable for demonstration than for implementation in computer programs. It is based on the idea of ​​reconstructing all other columns step by step, starting from the coded text that was in the last column of the forward transformation table. When this goal has been achieved , the back-coded text can be read in the line that the index specifies.

  1. Do the following until the table is full:
    1. The coded text is written character by character vertically in the last column of the table.
    2. The table is then rotated one position to the right so that the last column becomes the first column.
    3. Then the table is sorted.
  2. The text in the Index line is the searched, back-coded text.

Explanation

The table that is used in the forward transformation has an important property that is exploited in this backward transformation: The characters in the last column (and every other) appear with the same frequency in the first column of the table, just sorted. This means that if the first column is not known, but any other column, you can construct the first column from it.

After filling in the last column (step 1.1), the part of the table that has already been filled in corresponds to the part that was also used for the forward transformation. By rotating and then sorting (steps 1.2 and 1.3), the front columns of the table are correctly filled, since the table was sorted in the same way for the forward transformation. By rotating (step 1.2) the last column becomes free again so that it can be filled again with data that match the previous columns that have already been filled in.

Sample code in Lua

function BWT_rueckwaerts(text, index)
  local len = string.len(text)

  -- Am Anfang ist die Tabelle leer
  local tabelle = {}
  for i = 1, len do
    tabelle[i] = ""
  end

  for n = 1, len do
    -- Codierten Text zeichenweise vor die erste Spalte setzen
    for i = 1, len do
      tabelle[i] = string.sub(text, i, i) .. tabelle[i]
    end

    -- Tabelle sortieren
    for i = 1, len do
      for j = i + 1, len do
        if tabelle[i] > tabelle[j] then
          tabelle[i], tabelle[j] = tabelle[j], tabelle[i]
        end
      end
    end
  end

  return tabelle[index]
end

See also

  • Move to front , a coding method that is often used after the Burrows-Wheeler transformation.

Web links

Individual evidence

  1. M. Burrows, D. Wheeler: A block sorting lossless data compression algorithm. Technical Report 124. Digital Equipment Corporation, 1994