LZ78

from Wikipedia, the free encyclopedia

LZ78 is a data compression method developed by Jacob Ziv and Abraham Lempel . LZ78 is the successor to the LZ77 algorithm published a year earlier . LZ78 is also called LZ2 as the following LZ procedure .

While the LZ77 algorithm works with old data, LZ78 aims at new data. It does this by forward scanning the input buffer, which is compared to a dictionary. The algorithm scans through the buffer until no more hits are obtained with the dictionary. If that happens, the position of the word in the dictionary - if one is available - is output. The length of the hit and the symbol that caused the error are also output. The resulting word is then added to the dictionary.

Although LZ78 initially achieved great popularity, a few decades after its appearance, LZ77 was used more as parts of LZ78 were patented in the US until 2003 and in Europe, Canada and Japan until 2004. The best known form of LZ78 compression is the LZW algorithm, a modification of the LZ78 algorithm by Terry Welch .

example

This example illustrates how the most common variant of the LZ78 algorithm, the LZW, works. The status of the output and the dictionary are given in each step. To keep the example simple, a simple alphabet made up of only capital letters is used. Spaces and punctuation marks have been omitted. This example shows the compression of a very short message. With real data, the compression only becomes clearly visible from a certain length.

The message that is compressed in this example is:

 TOBEORNOTTOBEORTOBEORNOT#

The pound sign ("#") marks the end of the message. It follows that the dictionary will initially have 27 entries (26 capital letters and "#"). In order to be able to represent the entire dictionary, 5 bits are necessary. As the dictionary grows, more bits are needed to access all of the elements in the dictionary. 5 bits allow 32 possible combinations of bits. Therefore 6-bit long chains are used from the 33rd entry. The 33rd dictionary entry is marked as 32nd since the count starts at 0.

The dictionary will initially look like this:

  # = 00000
  A = 00001
  B = 00010
  C = 00011
  …
  …
  …
  Z = 11010

Encode

Without the LZ78 algorithm, the message would be 125 bits long (25 characters × 5 bits per character). After compression, the original length is compared with the new length.

Again the message to be compressed:

  • TOBEORNOTTOBEORTOBEORNOT #
Character: Bit code: (= output) New dictionary entry:
T 20 = 10100 28: TO
O 15 = 01111 29: OB
B. 2 = 00010 30: BE
E. 5 = 00101 31: EO
O 15 = 01111 32: OR ← from here 6 bits are required
R. 18 = 010010 33: RN
N 14 = 001110 34: NO
O 15 = 001111 35: OT
T 20 = 010100 36: TT
TO 28 = 011100 37: TOB
BE 30 = 011110 38: BEO
OR 32 = 100,000 39: LOCATION
TOB 37 = 100101 40: TOBE
EO 31 = 011111 41: EOR
RN 33 = 100001 42: RNO
OT 35 = 100011 43: OT #
# 0 = 000000

After compression, the length is therefore only 97 bits (5 × 5 + 12 × 6). The saving is therefore 28 bits, which means a reduction in size of the original message by 22%. If the text were longer, the dictionary would have longer entries, so the repeating words could be sent very compactly.

Decode

In order to be able to reconstruct a compressed message again, the initial dictionary must be known. The additional entries can be gradually restored as the compressed message is read.

Bits: Output: New entry (whole): New entry (part):
10100 = 20 T 28: T?
01111 = 15 O 28: TO 29: O?
00010 = 2 B. 29: OB 30: B?
00101 = 5 E. 30: BE 31: E?
01111 = 15 O 31: EO 32: O? ← read 6 bits from here
010010 = 18 R. 32: OR 33: R?
001110 = 14 N 33: RN 34: N?
001111 = 15 O 34: NO 35: O?
010100 = 20 T 35: OT 36: T?
011100 = 28 TO 36: TT (only add the first element of the next dictionary entry) 37: TO?
011110 = 30 BE 37: TOB 38: BE?
100000 = 32 OR 38: BEO 39: OR?
100101 = 37 TOB 39: LOCATION 40: TOB?
011111 = 31 EO 40: TOBE 41: EO?
100001 = 33 RN 41: EOR 42: RN?
100011 = 35 OT 42: RNO 43: OT?
000000 = 0 #

Problems can arise if the newly created dictionary entry is used immediately. In the example above, the decoder reaches the first character, a "T", it knows that the character 28 starts with a "T", but what does it end with? The problem is presented below. We decode the part of a message that reads as follows:

  • ABABA
Bits: Output: New entry (whole): New entry (part):
...
011101 = 29 FROM 46: (word) 47: AB?
101111 = 47 FROM? ← What are we doing here?

At first glance it seems like the decoder could not possibly decode this. We know entry 47 is supposed to be ABA, but how can the decoder know? The critical step is that 47 consists of 29 plus what comes after. 47 therefore ends with “whatever comes after”. But since it was sent immediately, it must also begin with “whatever comes after” and must therefore end with the same character that it begins with, namely A. This trick allows the decoder to determine that 47 must be “ABA” .

In general terms, this situation occurs whenever the encoder reads input in the form “cScSc”, where “c” stands for a single character and “S” for a string of characters and “cS” is already in the dictionary. The encoder outputs the character for "cS" and adds the new character "cSc" to the dictionary. Then it recognizes “cSc” in the input and sends a character that has just been added to the dictionary.