LZ77
LZ77 (Lempel-Ziv 77) is a lossless method for data compression that was published in 1977 by Abraham Lempel and Jacob Ziv . It is a dictionary-based procedure which for the first time makes use of the fact that entire sequences of data occur several times in a data set. In methods developed earlier (e.g. Huffman coding or Shannon-Fano coding ), only the frequency of individual characters was used (see also entropy coding ). LZ77 is also called LZ1 as the first LZ procedure .
definition
The LZ77 factorization of a character string is a decomposition into non-empty character strings , so
- and
- for each with :
- is a new letter that does not appear in or
- is the longest string that appears at least twice in .
The individual strings are known as factors or phrases .
From the definition it can be derived directly that is. Here referred to the string , from the position of all characters from up to and including position contains. is the abbreviation for the character string .
algorithm
The idea of the algorithm for calculating the LZ77 compression is to use the already processed prefix of the character string as a dictionary. For implementation reasons, the size of this dictionary is limited in practice in order to limit the search time. It is therefore (Engl. Usually a sliding window sliding window ) is used, which limited both the dictionary and the considered excerpt (Preview buffer). Compressed characters are gradually moved from the preview buffer to the dictionary. In practice, the buffer for the dictionary contains several thousand characters and the preview buffer for the section of the character string to be encoded contains around 100 characters (sometimes even less).
The algorithm produces a sequence of triples as output , with which the original text can be recovered. A triple for a factor has the form , where
- specifies the position of the previous occurrence of in (or , if there is no previous occurrence in ),
- is the length of the previous occurrence of (or 0 if is a new character)
- and the Mismatch letter is, d. H. for is .
It should be noted that when using a buffer for the dictionary, the position of a character string is to be understood relative to the left or right edge of the buffer (this can vary depending on the implementation), new characters must then be introduced. Due to the output format of the compression, the reconstruction of the text is possible without explicitly saving the dictionary.
Pseudocode
The pseudocode is a rendering of the LZ77 compression algorithm with a sliding window:
while Der Vorschaupuffer nicht leer ist; do
Durchsuche rückwärts das Wörterbuch nach der längsten übereinstimmenden
Zeichenkette mit dem Vorschaupuffer;
if Eine Übereinstimmung wurde gefunden; then
Gib das Tripel (Versatz zum Rand des Wörterbuch,
Länge der gefundenen Zeichenkette,
erstes nicht übereinstimmendes Zeichen aus dem Vorschaupuffer) aus;
Verschiebe das Fenster um die Länge+1;
else
Gib das Tripel (0, 0, erstes Zeichen im Vorschaupuffer) aus;
Verschiebe das Fenster um 1;
end if
end while
example
The calculation of the LZ77 factorization is illustrated using the character string aacaacabcabaaac
.
The table shows the calculation of the LZ77 factorization using a dictionary of length 12 and a preview buffer of length 10 (index from 0 to 9). In the rightmost column, read from top to bottom, there is the output of the algorithm (0, 0, a ), (1, 1, c ), (3, 4, b ), (3, 3, a ) and (12 , 3, end ). The position is given relative to the right edge of the dictionary; this must also be done during decoding.
The buffer function according to the principle of the sliding window (engl .: sliding window ), d. H. the data stream to be compressed is pushed into the buffer from the right. As noted in the algorithm, the shift is made by the length of the match found in the dictionary and another position. This means that redundant triples are avoided, since otherwise new characters are always taken over individually into the dictionary. In the example, the third triple (0, 0, c ) would have to be included, which, however, significantly worsens the compression rate. The matches are marked in green and the character string to be moved is marked in red. Please note that one more character is always shifted than was found in the match so that new characters do not have to be coded twice.
row | 12 | 11 | 10 | 9 | 8th | 7th | 6th | 5 | 4th | 3 | 2 | 1 | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | output | |
1 | empty | a | a | c | a | a | c | a | b | c | a | (0, 0, a ) | ||||||||||||
2 | empty | a | a | c | a | a | c | a | b | c | a | b | (1, 1, c ) | |||||||||||
3 | empty | a | a | c | a | a | c | a | b | c | a | b | a | a | (3, 4, b ) | |||||||||
4th | empty | a | a | c | a | a | c | a | b | c | a | b | a | a | a | c | The End | (3, 3, a ) | ||||||
5 | a | a | c | a | a | c | a | b | c | a | b | a | a | a | c | The End | (12, 3, end ) | |||||||
finished |
The first character seen is unknown, so the first is recorded a
with (0, 0, a ). This can a
already be read from the dictionary in the 2nd line (marked green) so that it is adopted c
as a new character. The 3rd line shows a special case of the LZ77 algorithm, as the matching character string extends into the preview window; this is shown in the example by the green-red cell color. Lines 4 and 5 are to be treated in the same way as the first two, with the exception that an end marker is introduced as the next character in the last triple , since the text has been completely compressed and there is no next character.
Runtime analysis
The runtime of the algorithm is also specified, since the text is only passed through once during compression. This is possible if the preview and dictionary buffer has a constant size and the search for a matching character string is of no importance for large texts. In practical use, implementation with a normal search (without additional data structures) is rather slow.
Advantages and disadvantages
A big advantage of the LZ77 is that it can compress without any knowledge of the text and that it is not patented. The LZ78 (the direct successor of the LZ77) requires the initial dictionary to reconstruct the text (which is usually the trivial dictionary in which each single-character symbol appears once sorted) and was partially protected by patents until 2004. A major disadvantage of the LZ77, however, is that it compresses little or even increases the amount of data, especially with small or non-natural language texts. This becomes clear in the example given, since the original text contains 16 characters and the compression requires 15 characters, which means only 1 character saving. Similar to the Burrows-Wheeler transformation or Move to front coder, it is very suitable as a preprocessor to use other compression methods such as B. Huffman coding to compress data effectively.
decompression
The decompression of LZ77 compressed files is much easier than the compression, as there is no need to search for matching strings. The running time is linear and has exactly as many steps as the original is long, so .
Pseudocode
The pseudocode is a rendering of the LZ77 decompression algorithm with a sliding window:
for Jedes Tripel (offset, length, symbol)
if length > 0; then
Durchlaufe Rückwärts die bisherige Ausgabe und gib solange Zeichen aus
bis eine Länge von length erreicht ist,
bei einem Überlauf beginne erneut bei Offset;
end if
Gib Symbol aus;
end for
example
The decompression of LZ77 encodings does not require an additional dictionary, the output triples are unique in order to reconstruct the text. The cells with a red background are cells that are used for the reconstruction in the next line.
row | input | 16 | 15th | 14th | 13 | 12 | 11 | 10 | 9 | 8th | 7th | 6th | 5 | 4th | 3 | 2 | 1 |
1 | (0, 0, a ) | empty | a | ||||||||||||||
2 | (1, 1, c ) | empty | a | a | c | ||||||||||||
3 | (3, 4, b ) | empty | a | a | c | a | a | c | a | b | |||||||
4th | (3, 3, a ) | empty | a | a | c | a | a | c | a | b | c | a | b | a | |||
5 | (12, 3, end ) | a | a | c | a | a | c | a | b | c | a | b | a | a | a | c | The End |
finished |
If the first entry in a triple is 0, the last entry in the triple is appended to the text (see line 1). In line 2 with the entry (1, 1, c ), the first entry of the reconstructed text from line 1 (first entry of the triple for position 1 and the second entry of the triple for length 1) is used again and then the character c
appended at the back. In line 3, the length of the re-used character string is longer than the stored data in the dictionary (length 4, but only 3 cells are available from memory cell 3), so output is continued from the start position (memory cell 3) and an additional a
issued. This is then added b
as the last entry of the triple to be processed. Lines 4 and 5 are to be understood analogously. The end character is ignored during reconstruction because it is to be understood globally as the end of the text. The fully reconstructed text reads aacaacabcabaaac
.
Efficient construction
For a runtime-efficient calculation of the LZ77 factorization of a character string, the factorization is calculated on the complete character string and the principle of the sliding window is abandoned. Finally, in several applications (see section Use), LZ77 compression is also used on the entire character string.
An efficient calculation of the LZ77 factorization for a character string with regard to the runtime is possible with the help of additional data structures. For a string denote the suffix array and the inverse suffix array of , i.e. H. it is exactly when . In addition, denote the length of the longest common prefix of the strings and , where is.
The calculation makes use of the fact that only the text positions and have to be considered for a factor of with for determining the position of the tuple . Where (NSV stands for next smaller value ) and (PSV stands for previous smaller value ).
Pseudocode
The algorithm LZ_Factor(i, psv, nsv)
returns the starting position of the next factor
Algorithmus LZ_Factor(i, psv, nsv):
if lcp(i, psv) > lcp(i, nsv); then
(pos, len) <- (psv, lcp(i, psv))
else
(pos, len) <- (nsv, lcp(i, nsv))
end if
if len = 0; then
pos <- i
end if
if i + len > n; then
print factor(pos, len, 'Ende')
else
print factor(pos, len, x[i+len])
end if
return i + max(len, 1)
The two arrays and can be calculated from the suffix array in linear time:
for i <- i to n+1; do
j <- i-1
while SA[i] < SA[j]; do
NSV[j] <- i
j <- PSV[j]
end while
PSV[i] <- j
end for
Finally, the algorithm for calculating the LZ77 factorization for the character string follows
Berechne SA, ISA, NSV und PSV für x
k <- 1
while k <= n; do
psv <- SA[PSV[ISA[k]]]
nsv <- SA[NSV[ISA[k]]]
k <- LZ_Factor(k, psv, nsv)
end while
Runtime analysis
The calculation of the LZ77 factorization of a string is possible with the algorithm in time, where is the length of the input string . The LZ_Factor method requires time when the calculation of over range minimum queries is realized. The calculation of the arrays and is possible in time, so that the entire algorithm needs time for the calculation (this assumes a linear time construction of the suffix array).
Comparison of different implementations
There are various implementations for calculating the LZ77 compression of a character string, which differ in terms of runtime and memory requirements. For the comparison, the LZ factorization is considered on the complete character string with the length . The analysis of the memory requirement is based on the storage in 32- or 64-bit words. A letter of the alphabet occupies a quarter of a memory word and an integer a whole memory word.
The following table lists some implementations for calculating the LZ77 factorization with their runtime, their memory requirements and the data structures used.
Algorithm of |
Worst-case runtime |
Storage requirement in words |
Used data structures |
Remarks |
---|---|---|---|---|
Abouelhoda et al. | Suffix array , LCP array | - | ||
Chen et al. | Suffix array, LCP array | - | ||
Chen et al. | Suffix array, LCP array | Position array overwrites suffix array | ||
Chen et al. | Suffix array, LCP array | Position array overwrites suffix array | ||
Crochemore and Ilie |
Suffix array, LCP array, LPF array | LZ77 output does not require any additional space | ||
Crochemore and Ilie |
Suffix array, LCP array, LPF array | LZ77 output does not require any additional space, no access to the character string for LZ77 compression is necessary |
Some of the algorithms listed in the table use the LPF array (LPF stands for Longest Previous Factor ). The LPF array is defined as follows: For each position of the character string is defined as the length of the longest factor of that starts at the position and occurs before the position in , i.e. H. . The LZ77 factorization of a character string can be calculated in linear time from the LPF array. The LPF array can be calculated with the NSV and PSV arrays listed above, because:
For the implementation of some of the algorithms listed in the table, a stack memory is used, which requires additional memory. The table shows the use of a stack by adding one to the memory requirement . The size of the stack memory is limited by - this is the case when the character string has the form with . Is expected .
use
Today the LZ77 compression is u. a. still used on the Game Boy Advance , the AutoCAD DWG format and other embedded systems . Combined with Huffman coding , LZ77 is used in the widely used deflate algorithm, which in turn is used by well-known compression programs and the PNG graphic format . The fact that LZ77 is not covered by any patents is probably the reason that the process is still preferred to the successor LZ78 published a year later , which was patented in parts in some places until 2004.
In the algorithmic processing of character strings, the LZ77 factorization is used to calculate regularities in strings. The factorization is always considered on the entire string and not just within the sliding window. In addition, the output can be simplified because the original string is available in the applications and does not have to be reconstructed. A selection of problems for which the LZ77 factorization can be used can be found in the following list:
- Determination of repetitions in character strings In a character string , non-empty substrings of the form (referred to as tandem repeat or square ) are searched for.
- Strings of repetitions with gaps (ger .: gaps ) of fixed size: In a string non-empty substrings of the form be with and for a given sought.
- Maximum periodicities in character strings: A maximum periodicity is searched for in a character string .
- A periodicity in a character string is a non-empty character string of the form where and is a prefix of . Here is the period length.
- An occurrence of the periodicity with a period length is maximum if
- is or is not a periodicity with period length and
- is or is not a periodicity with period length .
history
The quality of compression is directly dependent on the size of the dictionary. In order to get good compression rates, the sliding window must therefore reach a certain minimum size. Since in the algorithm the text to be compressed must be compared with every entry in the lexicon (see section Algorithm), the time required for compression increases linearly with the size of the window. The LZ77 algorithm in its pure form therefore initially received little attention. In 1982 James Storer and Thomas Szymanski improved some problems in the algorithm now called Lempel-Ziv-Storer-Szymanski (LZSS), which was also used by Phil Katz for the widely used deflate algorithm. The comparison of the text to be compressed with all entries in the lexicon is not necessary with an efficient implementation as in the section on efficient construction, where only a comparison with a maximum of two entries from the lexicon is necessary.
In Reduced Offset Lempel Ziv (ROLZ, also Lempel Ziv Ross Williams , by Ross Williams, 1991) and the Lempel-Ziv-Markow algorithm (LZMA, by Igor Pavlov, 1998) he found better-known, modern successors.
Related algorithms
- Lempel-Ziv-Welch algorithm , dictionary-based compression, often used with graphics data
- LZX algorithm , further development of the LZ77 algorithm
literature
- Mark Nelson, Jean-Loup Gailly: The Data Compression Book . 2nd Edition. M&T Books, New York 1996, ISBN 1-55851-434-1
- Anisa Al-Hafeedh et al .: A Comparison of Index-based Lempel-Ziv LZ77 Factorization Algorithms . In: ACM Computing Surveys , No. 1, Volume 45, 2012, pp. 5: 1–5: 17
- Annette Leßmöllmann : In the crosshairs of the zipper - compression programs can expose the author of a text even though they don't understand a single word . In: Die Zeit , No. 12/2002, p. 45
Web links
Individual evidence
- ^ Jacob Ziv, Abraham Lempel: A Universal Algorithm for Sequential Data Compression . In: IEEE Transactions on Information Theory, No. 3, Volume 23, 1977, pp. 337–343 cs.duke.edu (PDF)
- ↑ ^{a } ^{b } ^{c } ^{d} Anisa Al-Hafeedh et al .: A Comparison of Index-based Lempel-Ziv LZ77 Factorization Algorithms . In: ACM Computing Surveys , No. 1, Volume 45, 2012, pp. 5: 1–5: 17
- ↑ ^{a } ^{b } ^{c} Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi: Linear Time Lempel-Ziv Factorization: Simple, Fast, Small . In: CPM 2013 , Lecture Notes in Computer Science, Volume 7922, Springer, 2013, ISBN 978-3-642-38904-7
- ↑ Mohamed I. Abouelhoda, Stefan Kurtz, Enno Ohlebusch: Replacing suffix trees with enhanced suffix arrays . In: Journal of Discrete Algorithms , No. 1, Volume 2, 2004, pp. 53-86
- ↑ ^{a } ^{b } ^{c} Gang Chen, Simon J. Puglisi, William F. Smyth: Fast and Practical Algorithms for Computing All the Runs in a String . In: Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, Canada, July 9-11, 2007, Proceedings , pp. 307-315
- ↑ ^{a } ^{b } ^{c} Gang Chen, Simon J. Puglisi, William F. Smyth: Lempel-Ziv Factorization Using Less Time and Space . In: Mathematics in Computer Science , No. 4, Volume 1, 2008, pp. 605-623
- ↑ ^{a } ^{b} Maxime Crochemore, Lucian Ilie: Computing Longest Previous Factor in linear time and applications . In: Information Processing Letters , No. 2, Volume 106, 2008, pp. 75-80
- ↑ Maxime Crochemore: Transducers and repititions . In: Theoretical Computer Science , No. 1, Volume 45, 1986, pp. 63-86
- ^ Jens Stoye, Dan Gusfield: Simple and flexible detection of contiguous repeats using a suffix tree . In: Theoretical Computer Science , No. 1-2, Volume 270, 2002, pp. 843-856
- ^ Roman M. Kolpakov, Gregory Kucherov: Finding Repeats with Fixed Gap . In: Proceedings of the 7th International Symposium on String Processing and Information Retrieval (SPIRE) , 2000, IEEE Computer Society, pp. 162-168
- ^ Michael G. Main: Detecting leftmost maximal periodicities . In: Discrete Applied Mathematics , No. 1-2, Volume 25, 1989, pp. 145-153