Lempel-Ziv-Welch algorithm

from Wikipedia, the free encyclopedia

The Lempel-Ziv-Welch (short LZW algorithm or LZW is called) is a common in graphic formats for data compression , so to reduce the amount of data , inserted algorithm . Much of the functionality of this algorithm was developed and published in 1978 by Abraham Lempel and Jacob Ziv ( LZ78 ). Some detail improvements were made in 1983 by Terry A. Welch .

LZW is a lossless compression method . For example, it is used in the GIF image format developed by CompuServe employees in 1987 and optionally in TIFF . However, it is suitable for any form of data, as the dictionary used is only generated at runtime and is therefore independent of the format. LZW is probably the best-known representative of the LZ family.

functionality

LZW is compressed using a dynamic dictionary , in which the most frequently occurring character strings , such as B. accumulate “is”, “the” and “a” and then only need to be addressed under one abbreviation. The advantage of this algorithm is that no additional dictionary needs to be stored and that the dictionary adapts itself dynamically to the respective content. The decoder is able to reconstruct it from the data stream. Entries in the dictionary are usually addressed via a 12- bit long index . A maximum of 2 12 = 4096 entries are therefore possible. The entries with the index 0 to 255 are filled with the corresponding bytes , i.e. entry 0 with 00 hex , entry 2 with 02 hex ,…, entry 255 with FF hex ( hexadecimal system ). Subsequent entries that are inserted at runtime must therefore start with index 256. New entries are generated by saving the entry found plus the next character. If the character string found is only one character long, usually only this character is saved, since a reference to the corresponding element occupies 12 bits, but the character itself only occupies 8 bits. The distinction between whether a reference or a symbol appears in the bit stream can be set using a flag .

compression

Compression algorithm

The algorithm will initially return a 9-bit code, later this code can be up to 12 bits wide, provided the alphabet is not cleared beforehand by sending a clear code.

The lowest 256 values ​​of the coding alphabet are predefined and correspond to themselves when returned. The algorithm now searches for the longest existing pattern from the codes in the coding alphabet at the input and returns the corresponding value. At the beginning that would be just a byte, which is output as the ninth bit by a 9-bit code with 0. He then chains the next character of the input to this pattern and inserts the result as the next higher entry in the alphabet. And so it goes on all the time until the alphabet fills up. The alphabet is kept internally in the compressor, but not explicitly saved. The decompressor also builds it up from the input. He can reconstruct it. There is also the K [Omega] K case, in which the decompressor is not yet familiar with the pattern from the alphabet. But he can reconstruct the value.

To save a table with 4096 patterns, each of which is up to 4096 characters long, you would generally need 16  MiB . However, each pattern of length n in the table begins with a partial pattern of length n-1 , which is also located in the table. This means that the entire table can be stored in two fields, Prefix and Suffix . In this case contains the last character of the pattern k and the index of the start pattern (ie a reference to another entry in the table). If the pattern has a length of one, <empty pattern> is set to a constant . An entry in the table is shown in the algorithm as a pair of patterns = (prefix, suffix) . The algorithm then works as follows.

     initialisiere Mustertabelle mit (<leeres Muster>+zeichen) für alle Zeichen
     muster := <leeres Muster>
     solange noch Zeichen verfügbar
           zeichen := lies nächstes Zeichen
           wenn (muster+zeichen) in Mustertabelle dann
                 muster := (muster+zeichen)
           sonst
                 füge (muster+zeichen) zur Mustertabelle hinzu
                 Ausgabe muster
                 muster := zeichen
     wenn muster nicht <leeres Muster> dann
           Ausgabe muster

The variable pattern contains the index of the corresponding pattern in the table and output pattern means that the index of the current pattern is written to the output file. With the instruction pattern: = character , pattern is set to the index of the entry (<empty pattern> + character) . Since the pattern table was initialized with these patterns, this index corresponds exactly to the character.

Example of compression

An example with the string "LZWLZ78LZ77LZCLZMWLZAP"

String found entry output new entry
LZWLZ78LZ77LZCLZMWLZAP L. L. LZ (becomes <256>)
ZWLZ78LZ77LZCLZMWLZAP Z Z ZW (becomes <257>)
WLZ78LZ77LZCLZMWLZAP W. W. WL (becomes <258>)
LZ78LZ77LZCLZMWLZAP LZ (= <256>) <256> LZ7 (becomes <259>)
78LZ77LZCLZMWLZAP 7th 7th 78 (becomes <260>)
8LZ77LZCLZMWLZAP 8th 8th 8L (becomes <261>)
LZ77LZCLZMWLZAP LZ7 (= <259>) <259> LZ77 (becomes <262>)
7LZCLZMWLZAP 7th 7th 7L (becomes <263>)
LZCLZMWLZAP LZ (= <256>) <256> LZC (becomes <264>)
CLZMWLZAP C. C. CL (becomes <265>)
LZMWLZAP LZ (= <256>) <256> LZM (becomes <266>)
MWLZAP M. M. MW (becomes <267>)
WLZAP WL (= <258>) <258> WLZ (becomes <268>)
ZAP Z Z ZA (becomes <269>)
AP A. A. AP (becomes <270>)
P P P -

The result is the character string "LZW <256> 7 8 <259> 7 <256> C <256> M <258> ZAP" ("output" read from top to bottom), which corresponds to 16 12-bit characters ( 24 8-bit characters) contains the same information instead of the original 22 8-bit characters.

decompression

Decompression algorithm

Exactly the same pattern table can be generated from the code words one after the other for decompression, since during compression only the old pattern and not the new pattern with the next character was output. When compressed, each pattern starts with the last character of the previous pattern added to the table. Conversely, when decompressing, the last character of the pattern that must be added to the table is the same as the first character of the current pattern that is to be output.

It becomes problematic if the pattern to be output is not yet entered in the table. Then you cannot search for the first character of this pattern in the table. But that only happens if a pattern occurs several times in direct succession. Then the following applies: The new pattern is the previous pattern + first character of the previous pattern.

     INITIALISIERE Mustertabelle MIT (<leeres Muster>,Zeichen) FÜR ALLE Zeichen
     last := lies_ersten_Code()
     Ausgabe(Muster VON last)
     SOLANGE NOCH Codes_verfügbar() WIEDERHOLE:
        next := lies_nächsten_Code()
        WENN next IN Mustertabelle DANN:
           FÜGE ( (Muster VON last), erstes_Zeichen_von(Muster VON next)) ZUR Mustertabelle HINZU
        SONST:
           FÜGE ( (Muster VON last), erstes_Zeichen_von(Muster VON last)) ZUR Mustertabelle HINZU
        Ausgabe(Muster VON next)
        last := next

Decompression example

The characters are read in one after the other. A character and the preceding character or dictionary entry result in a new entry in the dictionary.

current character first letter new entry output
L. - - L.
Z Z LZ (= 256) Z
W. W. ZW (= 257) W.
<256> L. WL (= 258) LZ
7th 7th LZ7 (= 259) 7th
8th 8th 78 (= 260) 8th
<259> L. 8L (= 261) LZ7
7th 7th LZ77 (= 262) 7th
<256> L. 7L (= 263) LZ
C. C. LZC (= 264) C.
<256> L. CL (= 265) LZ
M. M. LZM (= 266) M.
<258> W. MW (= 267) WL
Z Z WLZ (= 268) Z
A. A. ZA (= 269) A.
P P AP (= 270) P

"Output" read from top to bottom results in the previously encoded string "LZWLZ78LZ77LZCLZMWLZAP".

variants

The LZ78 - algorithm works similarly, but starts with an empty dictionary.

LZC is just a slight modification of LZW. The index size and thus the dictionary size is variable, starts at 9 bits and can grow to a size specified by the user. Up to 7% better compression can be expected.

LZMW (by Victor S. Miller , Mark N. Wegman 1985) differs in that instead of just appending one character to a character string in the dictionary, each character string with the longest known string that can be found in the subsequent input immediately afterwards , can be appended. This is quite practical for special data (e.g. a file that consists of 10,000 "a" s), but LZW gets along better with general data.

Patents

Various patents have been issued in the United States and other countries for LZW and similar algorithms . LZ78 was the submitted on August 10, 1981, granted on August 7, 1984 U. S. Patent 4,464,650 of the Sperry Corporation (later Unisys merged) covered, as in the Lempel, Ziv, Cohn and Eastman inventors listed.

Two U.S. patents have been issued for the LZW algorithm: No. 4,814,746 by Victor S. Miller and Mark N. Wegman for IBM , filed June 1, 1983, and No. 4,558,302 by Welch for Sperry Corporation, later Unisys Corporation, filed June 20, 1983.

U.S. Patent 4,558,302 caused the greatest controversy. One of the most widespread uses for LZW was the GIF format for images, which became increasingly popular for websites in the 1990s . Unisys had already been demanding license fees for LZW use in hardware and hardware-related software since 1987, but allowed royalty-free use of the LZW algorithm, while GIF developed into a standard format alongside JFIF . In December 1994, however, Unisys began charging CompuServe license fees from developers of commercial software that could read and write the GIF format , and in 1999 expanded this to include free software. This exploitation as a software patent aroused outrage in developer and user circles around the world and motivated the rapid development of the more powerful graphic file format PNG, which is based exclusively on freely available code .

Many legal experts have concluded that the patent does not cover devices that decompress LZW data, but cannot compress it. For this reason, the widely used gzip program can read file archives in Z format, but not write them.

U.S. Patent 4,558,302 expired on June 20, 2003 after 20 years. The corresponding European, Canadian and Japanese patents followed in June 2004.

Web links

Individual evidence

  1. a b Patent US4558302 : High speed data compression and decompression apparatus and method. Published December 10, 1985 , Applicant: Sperry Corporation, Inventor: Terry Welch.
  2. Patent US4464650 : Apparatus and method for compressing data signals and restoring the compressed data signals. Published August 7, 1984 , Applicant: Sperry Corporation, Inventor: Lempel, Ziv, Cohn and Eastman.
  3. Patent US4814746 : Data compression method. Published March 21, 1989 , Applicant: IBM, Inventor: Victor S. Miller, Mark N. Wegman.