Deflate

Deflate ( English for letting the air out ) is an algorithm for lossless data compression . It was developed by Phil Katz for the ZIP developed -Archivformat and later the public domain supplied.

description

Deflate is a combination of the Lempel-Ziv-Storer-Szymanski algorithm and the Huffman coding .

LZSS replaces character strings that occur several times. This is followed by an entropy coding according to Huffman .

There are a number of parameters for Deflate that can be used to influence the execution speed and reduction rate:

Window size: The larger the data window is defined in which Deflate searches for existing strings, the more promising it is to find such a string, but the longer the algorithm also needs to execute. The default setting for the window size is usually 32 kibibytes .
Search intensity: The algorithm can search the window mentioned above in more or less detail. If, for example, fast execution is more important, a very good data reduction can be dispensed with in favor of speed. In the gzip program , this property can be specified using the parameters (- #) 1 to 9: 1 is the fastest, 9 is the most detailed.
Huffman tables: These can be determined for the available data or tables can be specified. The latter saves a bit of time, but may result in less good data reduction.

Compression is achieved in two steps:

Find and replace duplicate strings with references.
Replacement of symbols (characters) by partly shorter symbols weighted according to the frequency of occurrence.

Bitstream format

A deflate data stream (stream) consists of a sequence of blocks. Each block is preceded by a 3- bit header:

1 bit: marker for the last block in the data stream:
- 1: This is the last block in the stream.
- 0: There are more blocks to follow.
2 bits: coding method for this block:
- 00: a literal section that is between 0 and 65,535 bytes long.
- 01: a compressed block using a predefined static Huffman tree.
- 10: compressed block, with its own Huffman tree.
- 11: Reserved, not currently used.

Most of the blocks are encoded using 10the dynamic Huffman method. This creates an individually optimized Huffman tree for each block. Instructions to build the necessary Huffman tree follow immediately after the block header.

Elimination of duplicate strings

If a sequence of repeated bytes (corresponds to a repeated character string) is found within a block to be compressed, it is replaced with a backward reference . This points to a previous occurrence of the character string. A hit on a previous character string consists of a length (3 to 258 bytes) and a distance (1 to 32,768 bytes). This backward reference can extend over any number of blocks, but must be within the distance of 32 KiB within the already decompressed data, i.e. within the sliding window .

Bit reduction

The second phase of compression consists of replacing frequently used symbols (characters) with shorter forms of representation and rarer symbols (characters) with forms of representation that require slightly more space. This method of Huffman coding creates a prefix-free tree with non-overlapping spaces, in which the length of each bit sequence is inversely proportional to the probability of the occurrence of the respective symbol: the more frequently a symbol occurs, the shorter its bit sequence can be represented for coding.

A tree is created that has space for 288 symbols:

0 to 255: represents a literal / symbol 0 to 255.
256: End of the block - stops processing if it is the last block, otherwise processing continues with the next block.
257 to 285: combined with extra bits, a hit with 3 to 258 bytes.
286 to 287: not used, reserved and invalid, but still part of the tree.

The hit length always follows the distance. Depending on the code, additional extra bits may be read to determine the final distance. The distance tree consists of 32 symbols:

0 to 3: Distance 1 to 4
4 to 5: Distance 5 to 8, 1 extra bit
6 to 7: Distance 9 to 16, 2 extra bits
8 to 9: Distance 17 to 32, 3 extra bits
...
26 to 27: Distance 8,193 to 16,384, 12 extra bits
28 to 29: Distance 16,385 to 32,768, 13 extra bits
30 to 31: not used, reserved and invalid, but still part of the tree.

Note that for the symbols 2 to 29, the number of extra bits as can be calculated as follows: . ${\ displaystyle \ left \ lfloor {\ frac {n} {2}} - 1 \ right \ rfloor}$

Deflate64 / Enhanced Deflate

Deflate64 is a proprietary variant of the Deflate algorithm. The underlying mechanism remains unchanged. The only changes are:

Extension of the dictionary from 32 KiB to 64 KiB.
Extension of the distance tree: The distance symbols 30 and 31 in the distance tree reserved for future use in Deflate are now each assigned 14 extra bits in order to be able to address distances from 32 KiB to 64 KiB.
Extension of the symbol tree: The symbol 285 is extended by 16 extra bits. The original limitation of 258 bytes is no longer applicable and sequence lengths in the range from 3 to 65,538 bytes can be addressed.

Compared to Deflate, Deflate64 improves the compression rate slightly. At the same time, however, the compression takes a little longer.

In addition to PKZIP and a large number of commercial applications, many open source projects such as 7-Zip or Info-ZIP Deflate64 also support it.

Zlib does not support Deflate64. The reason is the insufficient overall improvement and the decision to treat Deflate64 as a proprietary format.

use

Deflate is used in the following common formats, among others:

in archive format ZIP ,
gzip files,
the image file format PNG ,
the image file format TIFF ,
in OpenDocument , the ISO standard format for Office files,

the Portable Document Format (PDF),

the Web Open Font Format and

the archive format CAB .

It is the most common method for compressed HTTP transfers, see the article Hypertext Transfer Protocol , section HTTP compression .

Implementations

The free program library zlib abstracts the use of Deflate for use in other programs. Over 500 programs use the algorithm in this way. The historical first implementation was in Phil Katz ' PKZIP . There are many other implementations in a number of different programming languages, namely in Java, Ada, Pascal, JavaScript and Seed7, or with other specializations. 7-Zip contains its own implementation, which is known for introducing a stronger, more computationally intensive level of compression. KZIP by Ken Silverman is a specialized in-house implementation that aims for the smallest possible file sizes and for some time was considered the best available tool for this niche. The free reference implementation of the Zopfli algorithm usually compresses even more compactly.

history

Following the example of LHA, Katz implemented the Lempel-Ziv-Storer-Szymanski algorithm published in 1982 , which was an improvement on the old Lempel-Ziv algorithm from 1977 . Deflate in 1993 with version 2 of the software PKZIP by Phil Katz 'company PKWare Inc. introduced. The exact specification of Deflate and the associated bit stream format is recorded in RFC 1951 from May 1996.

Web links

RFC 1951 - DEFLATE Compressed Data Format Specification version 1.3 (English)
Detailed explanation of the algorithm (English)
Example implementation in C ++ and illustration of a data block

swell

↑ Binary Essence - Deflate64 ( Memento from February 7, 2015 in the Internet Archive )
↑ http://zlib.net/apps.html
↑ http://www.jcraft.com/jzlib/
↑ http://unzip-ada.sourceforge.net/
↑ Archive link ( Memento from February 21, 2008 in the Internet Archive )
↑ https://github.com/nodeca/pako
↑ http://gildas-lormeau.github.com/zip.js/
↑ http://seed7.sourceforge.net/libraries/deflate.htm

[1] Binary Essence - Deflate64 ( Memento from February 7, 2015 in the Internet Archive )

[2] ttp://zlib.net/apps.html

[3] ttp://www.jcraft.com/jzlib/

[4] ttp://unzip-ada.sourceforge.net/

[5] Archive link ( Memento from February 21, 2008 in the Internet Archive )

[6] ttps://github.com/nodeca/pako

[7] ttp://gildas-lormeau.github.com/zip.js/

[8] ttp://seed7.sourceforge.net/libraries/deflate.htm