Run length coding

from Wikipedia, the free encyclopedia

The run-length coding ( English run-length encoding , RLE for short ), also the run-length coding , is a simple lossless compression algorithm and belongs to the group of entropy coders . It is suitable for compressing longer repetitions of symbols.

idea

The basic idea of ​​the algorithm is to replace every sequence of identical symbols with their number and, if necessary, the symbol. This means that only the places where the symbol changes in the message are marked. Since the length specification only grows logarithmically compared to the length of the sequence , you save a considerable amount of memory space, especially with long repetitive sequences . Conversely, the shorter the repetitions, the lower the savings.

Run-length coding is nowadays used as a precoding step (e.g. in image compression , such as JPEG ) in order to save effort in the following coding step. (E.g. Huffman coding saves looking at longer symbols, since these have already been reduced.)

Example:

Instead of a sequence with five repetitions of the character 0 and three times 1

0000 0111

you just save

5 3

The longer a single episode is, the greater the savings, because

  • for 10 repetitions you need two decimal places,
  • for 100 repetitions you need three decimal places,
  • for 1000 repetitions you need four decimal places, etc.

The same applies to any other number systems .

Procedure

Bit strings

There are only two options for coding bit sequences: a sequence of zeros or a sequence of ones. Every sequence of zeros is guaranteed to be followed by at least one one - and vice versa. The only exception is when the end of the message is reached.

The encoder now agrees with the decoder which bit to start with. This can either be by convention or, for example, by an additional bit at the beginning. The lengths of the zero and one sequences are then transmitted alternately. The decoder then doesn't have to do anything other than output a corresponding number of zero or one bits for each received value.

Example:

The starting sequence is:

1111 1110 0000 1000 0001 1111

This is used to code:

7 5 1 6 5

The minimum number of necessary binary digits is three, because this covers the number range from 0 to 7. The compressed sequence is then binary-coded

111 101 001 110 101

The original 24 bits could be reduced to 15 bits.

Multi-valued symbol sequences

When compressing symbol sequences that consist of an alphabet with more than two symbols, it is no longer clear which symbol follows next (e.g. bytes have an alphabet of 256 possible characters). In addition to the number of repetitions, the symbol that makes up the sequence must also be sent.

A special feature here is that the compressed message u. It may even become larger if the memory space for the number of repetitions is greater than the sequence itself.

Example:

The starting sequence is:

AAAA ABBB BBBB CDDD EE

This is used to code:

{A, 5}, {B, 7}, {C, 1}, {D, 3}, {E, 2}

In principle, a symbol could consist of two letters instead of just one:

{AA, 2}, {AB, 1}, {BB, 3}, {CD, 1}, {DD, 1}, {EE, 1}

In the worst case (no repetitions) the “compressed” message would be larger than the original. From the episode

ABCD

would

A1B1C1D1.

implementation

The basic algorithm (without optimizations) is easy to implement:

#include <stdio.h>

int main()
{
   int n = 1; /* Anzahl der Wiederholungen */
   int ch = -1; /* Aktuelles Zeichen */
   int prev_ch = getchar(); /* Vorheriges Zeichen */

   do {
      ch = getchar();

      if ((ch != prev_ch) || (n == 255)) /* Symbol/Wiederholungen-Tupel ausgeben, falls ein anderes Symbol kommt oder die maximal darstellbare Anzahl erreicht. */
      {
         /* printf("%c%c", prev_ch, n); */ /* Binäre Ausgabe */
         printf("%c, %d\n", prev_ch, n); /* Lesbare Ausgabe als 2er-Tupel */

         n = 0; /* Beginn einer neuen Folge */
      }

      prev_ch = ch;
      ++n;
   } while (ch != EOF);

   return 0;
}

Modifications

Sometimes there are very few repetitions in a message. In order to prevent every single occurrence in a multi-valued alphabet from being replaced by a tuple with length 1 (e.g.  ABCA1B1C1), only sequences of a certain length (e.g. four or more ) are coded.

But then you need a special character ( escape character ), which indicates that a compressed tuple follows. In the ideal case, this special character or symbol does not otherwise appear in the message, otherwise one chooses a symbol which one assumes that it rarely occurs. The peculiarity of this symbol is that it has to be coded as a tuple every time (even if it only occurs once), otherwise it is not possible to differentiate between the symbol and the tuple.

Example :

The original message is:

Auus die Maaaaauuuuus (Length: 21 characters)

We choose the character “s” as the escape character (for clarity). In addition, we only code sequences that contain at least three repetitions:

Auuss1 die Msa5su5ss1 (Length: 21 characters)

If a letter is repeated more than three times or if it is the escape character, the output of the escape character indicates that a tuple with length specification follows. This is followed by the number of repetitions and finally the corresponding symbol.

The escape character does indeed result in additional memory requirements, but in the present case this is offset by the saving in length 1 sequences. The message would be naively coded as follows:

1A2u1s_1d1i1e_1M5a5u1s (Length: 22 characters)

In the PCX file format , a different escape character is used depending on the number of repetitions (these make up a quarter of the character set in PCX), so that the escape and length characters coincide to form one character.

Applications

Run-length coding is used in combination with a modified Huffman coding for fax transmission according to ITU-T recommendation T.30 ("G3 fax"). When transferring black and white pages, run length coding achieves good results, since long white areas alternate with shorter black areas.

In the lossy compression of images, the run length coding is applied to the individual coefficients after the transformation into the frequency domain. After the quantization, in particular, there are usually many identical values ​​or zeros that can be effectively compressed with run-length coding. Then this “pre - compressed ” data is further compressed with the Huffman coding .

File formats

Well-known file formats that use run length encoding are some older graphic formats such as Windows Bitmap , GEM Image , Targa or PCX . Under Microsoft Windows , the file extension * .rle is usually used for RLE-compressed images .

literature

  • David Salomon: Data Compression: The Complete Reference . 4th edition. Springer, 2007, ISBN 978-1-84628-602-5 .

Individual evidence

  1. T.30: Procedures for document facsimile transmission in the general switched telephone network. ITU-T, accessed August 15, 2013 .