Progressive compression

Progressive compression , also compact compression or solid compression ( English solid compression ), is a method and a pre-processing step in the compression of multiple files. The files are combined into one or more large blocks and then compressed across files. In this way you usually achieve a higher compression than if each file is compressed individually. How big this benefit is depends on how similar the files are.

function

Solid compression can always be used when several files are combined in one archive . If solid compression is used, then all files are combined before the actual compression and then compressed as a single continuous data stream. Otherwise, d. H. Without solid compression, the individual files are initially compressed independently of one another and only combined into an archive file after the actual compression. Usually progressive compression improves the compression rate - especially with many smaller and similar files (such as log files ). This is due to the fact that with solid compression redundancies between different files can also be used for data reduction, whereas without solid compression only redundancies within the respective file can be used.

Some archiving programs (e.g. RAR) sort the files beforehand according to file type in order to improve the compression rate even more.

Archiving without (above) and with (below) solid compression - schematic representation

Technical explanation

Modern packing programs use a combination of dictionary compression (e.g. LZ77 ) and entropy coding (e.g. Huffman coding ), as shown in the following. a. is the case with the Deflate algorithm.

The aim of dictionary compression is to replace byte sequences that occur multiple times so that they only have to be saved once. The dictionary (also lexicon ) used for this is now generally built up successively during the compression or decompression process (based on an initial, empty dictionary) so that the dictionary does not have to be transferred or saved separately. Only literals , i.e. byte sequences that are not yet in the dictionary, or references to existing dictionary entries are transferred. All transmitted literals are immediately added to the dictionary - by both the encoder and the decoder. This leads to a "warm-up phase" at the beginning of the compression process, in which the dictionary must first be filled before data can actually be saved through dictionary references (cf. e.g. LZ77 and LZ78).

Without solid compression, only redundancies within the respective file can be removed. Redundancies between multiple files are not taken into account, since each file starts with an empty dictionary. The compression also suffers a little because of the new “warm-up phase” every time. With solid compression, however, the same dictionary can be used for all files. This is particularly useful for files with similar content. This advantage is all the more pronounced with small files, since the "warm-up phase" has a larger share here.

The above considerations also apply in a similar form to the adaptive context models that are used in entropy coding (see PPMD or LZMA ).

disadvantage

Since a solid archive only consists of a data stream that has been continuously compressed, random access to individual files is not possible. This means that when you unpack a particular file, all files in the archive before this file must first be unpacked. This usually only takes place in the main memory, as only part of the data is required later to decompress the desired file and the rest can be discarded. If the archive is damaged, an error can extend beyond the affected file. How far depends on the level of the compression rate, but can lead to the loss of all data from the error position.

Additional files can only be added to the end of the archive.

The only way to delete files from the archive file is to completely decompress the data stream, remove the files to be deleted and then solidly compress the files again.

In order not to have to unpack the complete archive every time, the length of contiguous compressed data can also be limited as a compromise solution and thus independently compressed blocks can be created.

use

Progressive compression is supported by the archive formats 7z , RAR , ACE and ARC , among others .

In Unix environments, separate archiving and compression tools are traditionally used (see Unix philosophy ). Usually all files are first combined into an (uncompressed) archive using the tar tool , which can then be compressed. For compression z. B. gzip (results .tar.gz), bzip2 (results .tar.bz2) or xz (results .tar.xz) can be used. This procedure corresponds to a progressive compression.

The popular ZIP file format, however, does not support progressive archiving. However, progressive compression can be achieved by using two nested ZIP archives. To do this, you first combine all individual files with compression-free ZIP archiving. Then compress this ZIP file with the desired compression level.

Individual evidence

↑ ^a ^b Archive file formats and archivers "Solid archives" ( Memento from April 27, 2007 in the Internet Archive ) at schmidt.devlib.org (English)

[schmidt_devlib_org-1] Archive file formats and archivers "Solid archives" ( Memento from April 27, 2007 in the Internet Archive ) at schmidt.devlib.org (English)