There are a number of parameters for Deflate that can be used to influence the execution speed and reduction rate:
- Window size
- The larger the data window is defined in which Deflate searches for existing strings, the more promising it is to find such a string, but the longer the algorithm also needs to execute. The default setting for the window size is usually 32 kibibytes .
- Search intensity
- The algorithm can search the window mentioned above in more or less detail. If, for example, fast execution is more important, a very good data reduction can be dispensed with in favor of speed. In the gzip program , this property can be specified using the parameters (- #) 1 to 9: 1 is the fastest, 9 is the most detailed.
- Huffman tables
- These can be determined for the available data or tables can be specified. The latter saves a bit of time, but may result in less good data reduction.
Compression is achieved in two steps:
- Find and replace duplicate strings with references.
- Replacement of symbols (characters) by partly shorter symbols weighted according to the frequency of occurrence.
A deflate data stream (stream) consists of a sequence of blocks. Each block is preceded by a 3- bit header:
- 1 bit: marker for the last block in the data stream:
1: This is the last block in the stream.
0: There are more blocks to follow.
- 2 bits: coding method for this block:
00: a literal section that is between 0 and 65,535 bytes long.
01: a compressed block using a predefined static Huffman tree.
10: compressed block, with its own Huffman tree.
11: Reserved, not currently used.
Most of the blocks are encoded using
10the dynamic Huffman method. This creates an individually optimized Huffman tree for each block. Instructions to build the necessary Huffman tree follow immediately after the block header.
Elimination of duplicate strings
If a sequence of repeated bytes (corresponds to a repeated character string) is found within a block to be compressed, it is replaced with a backward reference . This points to a previous occurrence of the character string. A hit on a previous character string consists of a length (3 to 258 bytes) and a distance (1 to 32,768 bytes). This backward reference can extend over any number of blocks, but must be within the distance of 32 KiB within the already decompressed data, i.e. within the sliding window .
The second phase of compression consists of replacing frequently used symbols (characters) with shorter forms of representation and rarer symbols (characters) with forms of representation that require slightly more space. This method of Huffman coding creates a prefix-free tree with non-overlapping spaces, in which the length of each bit sequence is inversely proportional to the probability of the occurrence of the respective symbol: the more frequently a symbol occurs, the shorter its bit sequence can be represented for coding.
A tree is created that has space for 288 symbols:
- 0 to 255: represents a literal / symbol 0 to 255.
- 256: End of the block - stops processing if it is the last block, otherwise processing continues with the next block.
- 257 to 285: combined with extra bits, a hit with 3 to 258 bytes.
- 286 to 287: not used, reserved and invalid, but still part of the tree.
The hit length always follows the distance. Depending on the code, additional extra bits may be read to determine the final distance. The distance tree consists of 32 symbols:
- 0 to 3: Distance 1 to 4
- 4 to 5: Distance 5 to 8, 1 extra bit
- 6 to 7: Distance 9 to 16, 2 extra bits
- 8 to 9: Distance 17 to 32, 3 extra bits
- 26 to 27: Distance 8,193 to 16,384, 12 extra bits
- 28 to 29: Distance 16,385 to 32,768, 13 extra bits
- 30 to 31: not used, reserved and invalid, but still part of the tree.
Note that for the symbols 2 to 29, the number of extra bits as can be calculated as follows: .
Deflate64 / Enhanced Deflate
Deflate64 is a proprietary variant of the Deflate algorithm. The underlying mechanism remains unchanged. The only changes are:
- Extension of the dictionary from 32 KiB to 64 KiB.
- Extension of the distance tree: The distance symbols 30 and 31 in the distance tree reserved for future use in Deflate are now each assigned 14 extra bits in order to be able to address distances from 32 KiB to 64 KiB.
- Extension of the symbol tree: The symbol 285 is extended by 16 extra bits. The original limitation of 258 bytes is no longer applicable and sequence lengths in the range from 3 to 65,538 bytes can be addressed.
Compared to Deflate, Deflate64 improves the compression rate slightly. At the same time, however, the compression takes a little longer.
Zlib does not support Deflate64. The reason is the insufficient overall improvement and the decision to treat Deflate64 as a proprietary format.
Deflate is used in the following common formats, among others:
- in archive format ZIP ,
- gzip files,
- the image file format PNG ,
- the image file format TIFF ,
- in OpenDocument , the ISO standard format for Office files,
- the Portable Document Format (PDF),
- the Web Open Font Format and
- the archive format CAB .
It is the most common method for compressed HTTP transfers, see the article Hypertext Transfer Protocol , section HTTP compression .
Following the example of LHA, Katz implemented the Lempel-Ziv-Storer-Szymanski algorithm published in 1982 , which was an improvement on the old Lempel-Ziv algorithm from 1977 . Deflate in 1993 with version 2 of the software PKZIP by Phil Katz 'company PKWare Inc. introduced. The exact specification of Deflate and the associated bit stream format is recorded in RFC 1951 from May 1996.
- RFC 1951 - DEFLATE Compressed Data Format Specification version 1.3 (English)
- Detailed explanation of the algorithm (English)
- Example implementation in C ++ and illustration of a data block
- Binary Essence - Deflate64 ( Memento from February 7, 2015 in the Internet Archive )
- Archive link ( Memento from February 21, 2008 in the Internet Archive )