LZ78
LZ78 is a data compression method developed by Jacob Ziv and Abraham Lempel . LZ78 is the successor to the LZ77 algorithm published a year earlier . LZ78 is also called LZ2 as the following LZ procedure .
While the LZ77 algorithm works with old data, LZ78 aims at new data. It does this by forward scanning the input buffer, which is compared to a dictionary. The algorithm scans through the buffer until no more hits are obtained with the dictionary. If that happens, the position of the word in the dictionary - if one is available - is output. The length of the hit and the symbol that caused the error are also output. The resulting word is then added to the dictionary.
Although LZ78 initially achieved great popularity, a few decades after its appearance, LZ77 was used more as parts of LZ78 were patented in the US until 2003 and in Europe, Canada and Japan until 2004. The best known form of LZ78 compression is the LZW algorithm, a modification of the LZ78 algorithm by Terry Welch .
example
This example illustrates how the most common variant of the LZ78 algorithm, the LZW, works. The status of the output and the dictionary are given in each step. To keep the example simple, a simple alphabet made up of only capital letters is used. Spaces and punctuation marks have been omitted. This example shows the compression of a very short message. With real data, the compression only becomes clearly visible from a certain length.
The message that is compressed in this example is:
TOBEORNOTTOBEORTOBEORNOT#
The pound sign ("#") marks the end of the message. It follows that the dictionary will initially have 27 entries (26 capital letters and "#"). In order to be able to represent the entire dictionary, 5 bits are necessary. As the dictionary grows, more bits are needed to access all of the elements in the dictionary. 5 bits allow 32 possible combinations of bits. Therefore 6-bit long chains are used from the 33rd entry. The 33rd dictionary entry is marked as 32nd since the count starts at 0.
The dictionary will initially look like this:
# = 00000 A = 00001 B = 00010 C = 00011 … … … Z = 11010
Encode
Without the LZ78 algorithm, the message would be 125 bits long (25 characters × 5 bits per character). After compression, the original length is compared with the new length.
Again the message to be compressed:
- TOBEORNOTTOBEORTOBEORNOT #
Character: | Bit code: (= output) | New dictionary entry: |
---|---|---|
T | 20 = 10100 | 28: TO |
O | 15 = 01111 | 29: OB |
B. | 2 = 00010 | 30: BE |
E. | 5 = 00101 | 31: EO |
O | 15 = 01111 | 32: OR ← from here 6 bits are required |
R. | 18 = 010010 | 33: RN |
N | 14 = 001110 | 34: NO |
O | 15 = 001111 | 35: OT |
T | 20 = 010100 | 36: TT |
TO | 28 = 011100 | 37: TOB |
BE | 30 = 011110 | 38: BEO |
OR | 32 = 100,000 | 39: LOCATION |
TOB | 37 = 100101 | 40: TOBE |
EO | 31 = 011111 | 41: EOR |
RN | 33 = 100001 | 42: RNO |
OT | 35 = 100011 | 43: OT # |
# | 0 = 000000 |
After compression, the length is therefore only 97 bits (5 × 5 + 12 × 6). The saving is therefore 28 bits, which means a reduction in size of the original message by 22%. If the text were longer, the dictionary would have longer entries, so the repeating words could be sent very compactly.
Decode
In order to be able to reconstruct a compressed message again, the initial dictionary must be known. The additional entries can be gradually restored as the compressed message is read.
Bits: | Output: | New entry (whole): | New entry (part): |
---|---|---|---|
10100 = 20 | T | 28: T? | |
01111 = 15 | O | 28: TO | 29: O? |
00010 = 2 | B. | 29: OB | 30: B? |
00101 = 5 | E. | 30: BE | 31: E? |
01111 = 15 | O | 31: EO | 32: O? ← read 6 bits from here |
010010 = 18 | R. | 32: OR | 33: R? |
001110 = 14 | N | 33: RN | 34: N? |
001111 = 15 | O | 34: NO | 35: O? |
010100 = 20 | T | 35: OT | 36: T? |
011100 = 28 | TO | 36: TT (only add the first element of the next dictionary entry) | 37: TO? |
011110 = 30 | BE | 37: TOB | 38: BE? |
100000 = 32 | OR | 38: BEO | 39: OR? |
100101 = 37 | TOB | 39: LOCATION | 40: TOB? |
011111 = 31 | EO | 40: TOBE | 41: EO? |
100001 = 33 | RN | 41: EOR | 42: RN? |
100011 = 35 | OT | 42: RNO | 43: OT? |
000000 = 0 | # |
Problems can arise if the newly created dictionary entry is used immediately. In the example above, the decoder reaches the first character, a "T", it knows that the character 28 starts with a "T", but what does it end with? The problem is presented below. We decode the part of a message that reads as follows:
- ABABA
Bits: | Output: | New entry (whole): | New entry (part): |
---|---|---|---|
... | |||
011101 = 29 | FROM | 46: (word) | 47: AB? |
101111 = 47 | FROM? ← What are we doing here? |
At first glance it seems like the decoder could not possibly decode this. We know entry 47 is supposed to be ABA, but how can the decoder know? The critical step is that 47 consists of 29 plus what comes after. 47 therefore ends with “whatever comes after”. But since it was sent immediately, it must also begin with “whatever comes after” and must therefore end with the same character that it begins with, namely A. This trick allows the decoder to determine that 47 must be “ABA” .
In general terms, this situation occurs whenever the encoder reads input in the form “cScSc”, where “c” stands for a single character and “S” for a string of characters and “cS” is already in the dictionary. The encoder outputs the character for "cS" and adds the new character "cSc" to the dictionary. Then it recognizes “cSc” in the input and sends a character that has just been added to the dictionary.