Data compression

from Wikipedia, the free encyclopedia

The data compression (probably loan-translated and Germanized from the English ' data compression ' ) - also (further Germanized) called data compression - is a process in which the amount of digital data is compressed or reduced . This reduces the storage space required and the data transmission time . In communications technology , the compression of messages from a source by a sender is known as source coding .

In principle, data compression attempts to remove redundant information. For this purpose, the data is transferred to a representation with which all - or at least most of the information can be presented in a shorter form. This process is carried out by an encoder and the process is known as compression or compression . The reverse is known as decompression or decompression .

One speaks of lossless compression, lossless coding or redundancy reduction when exactly the original data can be recovered from the compressed data. This is necessary , for example, when compressing executable program files.

With lossy compression or irrelevance reduction , the original data can usually no longer be exactly recovered from the compressed data, that is, part of the information is lost; the algorithms try to omit only "unimportant" information as far as possible. Such methods are often used for image or video compression and audio data compression .

General

Data compression occurs in most long-distance transmissions of digital data today. It helps to save resources in the transmission or storage of data by transforming it into a form that - depending on the application - is as minimal as possible. Only data that is redundant in some form can be compressed without loss . If there is no redundancy - for example in the case of completely random data - lossless compression is in principle impossible because of the Kolmogorov complexity . Likewise, the dovecote principle prohibits any file from being compressed without loss. On the other hand, lossy compression is always possible: An algorithm arranges the data according to how important they are and then discards the "unimportant" ones. In the list of how important which components are, more and more can be discarded by shifting the "keep threshold" accordingly.

With data compression, computational effort is required on both the sender and the receiver side in order to compress or restore the data. However, the computational effort is very different for different compression methods. Deflate and LZO , for example, are very fast in compression and decompression, while LZMA , for example , achieves particularly extensive compression - and thus the smallest possible amount of data - with great effort, while compressed data can be converted back into its original form very quickly. Depending on the area of ​​application, this forces a different choice of compression method. Compression methods are therefore optimized for data throughput , energy requirements or data reduction, and the aim of compression is therefore not always as compact as possible. The difference becomes clear in these examples:

  • If video or sound recordings are broadcast live, compression and recovery must be performed as quickly as possible. Quality losses are justifiable if the maximum (possible) transmission rate is adhered to. This applies, for example, to telephone calls, where the conversation partner can often still be understood even if the sound quality is poor.
  • If a single file is downloaded by countless users, a slow but very powerful compression algorithm is worthwhile. The reduced bandwidth during transmission easily makes up for the time required for compression.
  • When backing up and archiving data, an algorithm must be used that may also be used in the distant future. In this case, only popular, tried and tested algorithms come into question, which sometimes do not have the best compression rates.
  • The type of data is also relevant for the selection of the compression method. For example, the two compression programs gzip and bzip2 common on Unix-like operating systems have the properties that gzip only compresses blocks of 32,000 bytes, while bzip2 has a block size of 900,000 bytes. Redundant data is only compressed within these blocks.

Sometimes the data is transformed into another representation before compression. This enables some methods to subsequently compress the data more efficiently. This preprocessing step is called precoding . An example of this is the Burrows-Wheeler transformation and Move to front in bzip2 .

The field of data compression partly overlaps with information theory and artificial intelligence , and in the area of ​​lossy data compression also with perceptual psychology (see below). Information theory is affected because the file size of a data set that is as compressed as possible directly indicates the information content of this data set.

If a compression algorithm can learn under what circumstances a "D" follows the character string "ABC", the "D" does not have to be saved in the compressed file - when the original file is restored, the algorithm knows where a "D" is "is to be inserted. Although such a compression algorithm has not yet been put into practical use, various compression methods using artificial neural networks and machine learning are under development.

Limits of compressibility

Lossy Compression

As described above, lossy compression is always possible - the threshold for what is considered "redundant" can be increased until only 1 bit remains. The boundaries are fluid and are determined by the use case: For example, "The house is large" could be compressed to "The house is large"; if the reader wants to know "which properties does the house have?", it is no longer possible to distinguish whether it is "gray", "green" or "large". If the reader wants to know "was something said about a house?", This can still be clearly affirmed.

In the case of lossy image compression, details are increasingly lost / become blurred, and ultimately everything "blurs" into a surface with a uniform color; an audio recording usually becomes dull and indistinct; after the greatest possible compression with most algorithms, it would only have a simple sine tone .

Lossless compression

With lossless compression, much narrower limits apply, since it must be guaranteed that the compressed file can be transformed back into the original file.

The Kolmogorow Complexity deals with the smallest possible "manual" that is necessary to restore the original data from the compressed data. For example, the number "100000000000000000000000000000000000" can be compressed very easily: "Write 1 and then 35 zeros", which is a compression from 36 to 29 characters. Any number of digits after the decimal point of the circle number Pi can also be compressed with its calculation rule - in which case the compression algorithm would have to recognize that it is the number Pi. It should be noted that the recovery algorithm should also be added to the file size for compressed files, since any compressed file is worthless without such an algorithm. The above number could also be compressed with "10 ^ 35" or "1e35", whereby the reader must then be aware of the method of restoration, namely the power notation . However, if a character string does not have any recognizable structure / peculiarities, compression is not possible - the instructions should contain the unchanged original data.

Another reason why some data is incompressible is the so-called dovecote principle : If there are fewer nesting places for pigeons than there are pigeons in the dovecote , two or more pigeons have to share a nesting place. One of 2 n possible pieces of information can be stored on an n-bit storage space , and consequently only one of half as much possible information can be stored on a storage space that is one bit smaller: 16 bits → 2 16 = 65536 possible information , 15 bits → 2 15 = 32768 possible information. Assuming that every possible file could be reduced by one bit, this would mean, according to the dovecote principle, that each storage space would have to contain two different compressed files at the same time. However, since there must be a reversible, unambiguous assignment between compressed and uncompressed files in lossless compression, this is forbidden.

If the dovecote principle were not valid, and if there were an algorithm that could compress any file by at least one bit, this could be applied recursively to the compressed file - any information could be reduced to 0 bits. In practice, data that has already been compressed can only be compressed again if a not 100% efficient algorithm was used in the previous run, which has not yet completely removed the redundancy (e.g. a very large file full of zeros is displayed twice with gzip compressed).

The conclusion from these two facts is that purely random data is (most likely) uncompressible (since it is mostly unstructured) and that much, but not all, of the data can be compressed. Two prizes, $ 100 for successfully compressing a million random digits and $ 5,000 for successfully compressing a file of any length generated by award sponsor Mike Goldman, have not been paid out to date.

Lossless compression

With lossless compression, the original data can be exactly restored from the compressed data. No information is lost in the process. Essentially, lossless compression processes use the redundancy of data, which is also referred to as redundancy reduction .

The theoretical basis is the information theory (related to the algorithmic information theory ). Due to the information content, it specifies a minimum number of bits that are required to code a symbol. Lossless compression methods try to encode messages in such a way that they approximate their entropy as closely as possible.

text

Texts, provided they consist of letters or are stored as character strings , and therefore not as an image ( raster graphics , typically an image file after a book has been scanned ), take up comparatively little storage space . This can be reduced to 20% to 10% of the space originally required by a method of lossless compression.

Examples:

Ausgangstext: AUCH EIN KLEINER BEITRAG IST EIN BEITRAG
Kodiertext: AUCH EIN KLEINER BEITRAG IST /2 /4

Here it was recognized that the words EIN and CONTRIBUTION appear twice, indicating that these correspond to the ones that were just in the past. On closer inspection, the EIN contained in KLEINER could also be coded accordingly.

Dictionary method

Token-based compression is related . Frequently recurring keywords are replaced by abbreviations, tokens .

Ausgangstext: Print "Hallo"; Print "Hier"
Kodiertext: 3F "Hallo"; 3F "Hier"

For the assignment of the tokens to the actual words, either an external dictionary must be available or it must be visible / included in the compressed file.

Run length encoding (RLE)

Identical text components that are in a row are only saved once - with the number of repetitions. Here "10 degrees," is repeated three times:

Ausgangstext: In den letzten Tagen betrug die Temperatur 10 Grad, 10 Grad, 10 Grad, und dann 14 Grad.
Kodiertext: In den letzten Tagen betrug die Temperatur/3/ 10 Grad,/ und dann 14 Grad.

The Burrows-Wheeler transformation is a reversible operation which transforms a given text in such a way that the same letters appear one after the other as often as possible. The data can then be compressed with RLE.

Entropy coding

Procedure of so-called entropy coding :

The well-known Morse code works on a similar principle and serves as a good example: Frequent letters in the English language (e.g. E = . ) Are saved as short codes, rarely as long codes (e.g. Q = _ _. _ ).

As an example, a source text with a length of 66 characters (data volume 462 bits with 7 bits per character, see ASCII ):

WENN HINTER FLIEGEN FLIEGEN FLIEGEN, FLIEGEN FLIEGEN FLIEGEN NACH.

A very simple but not very efficient entropy coding consists of sorting all parts of a message (see table; "_" stands for the space) according to their frequency and numbering them using binary numbers:

Text part ... …is replaced by…
_TO FLY 1
IF_ 10
_TO. 11
BEHIND 100
, 101

The text compressed with this dictionary is

10 100 1 1 1 101 1 1 1 11

and requires 50 bits in binary coding, because the result contains three different characters (0, 1 and the separator ""), i.e. 2 bits per character. A Huffman code, i.e. the following dictionary,

Text part ... ...is replaced by...
_TO FLY 1
IF_ 011
_TO. 010
BEHIND 001
, 000

is more efficient, because it leads directly to a binary result with a length of 18 bits:

011001111000111010

In both cases, however, the dictionary must also be saved in the compressed file - otherwise the source text cannot be reconstructed.

Program files

In the case of program files, it is critical that they return to their original state after they have been decompressed. Otherwise an error-free or correct execution would be unlikely. Compressed program files are usually themselves executable files. They consist of a routine that decompresses the program code and then executes it. As a result, the compression of the program is completely 'transparent' to the user (he does not notice it).

Application examples are UPX and Upack .

Lossy Compression

In lossy compression, irrelevant information is removed, which is also referred to as irrelevance reduction . Part of the information from the original data is lost, so that the original can no longer be reconstructed from the compressed data.

What is needed is a model that decides what portion of the information is dispensable for the recipient. Lossy compression is mostly used in image, video and audio transmission. Human perception is used as a model there. A popular example is the MP3 audio format , which removes frequency patterns that humans cannot hear or hear poorly.

The theoretical basis is the rate distortion theory . It describes the minimum data transmission rate required to transmit information with a certain quality.

Pictures, videos and sound recordings

In the case of highly compressed images in JPEG format, 8 × 8 pixel squares emerge as compression artifacts . Above the original size, below an enlarged section
Comparison of the compression artifacts in JPEG format with the lossless PNG format

Sound, image and film are areas of application for lossy compression. Otherwise the often enormous amounts of data would be very difficult to handle. The recording devices already limit the data volume. The reduction of the stored data is based on the physiological perception properties of humans. The compression by algorithms typically makes use of the conversion of signal curves from scanning signals into a frequency representation.

In human acoustic perception, frequencies above approx. 20 kHz are no longer perceived and can already be cut off in the recording system. Existing, quiet secondary tones in a sound mixture are also difficult to perceive if very loud tones occur at exactly the same time, so that the inaudible frequency components can be removed by the data compression system (see psychoacoustics ) without this being perceived as annoying by the listener would. When digitized acoustic events (music, speech, noises) are reduced to values ​​of around 192 kbit / s (as is the case with many Internet downloads), people can find little or no difference in quality compared to the uncompressed source material (such as a CD).

In human optical perception, colors are less resolved than changes in brightness, from which the YUV-422 reduction , which is already known in analog color television , is derived . Edges, on the other hand, are more significant, and there is a biological contrast enhancement ( Mach stripes ). With moderate low-pass filtering for color reduction, for example using the JPEG algorithm based on DCT transformation or the newer JPEG2000 algorithm based on wavelet transformation , the amount of data can usually be reduced to 10% or less of the original amount of data without significant quality reductions.

Moving images (films) consist of successive individual images. The first approach was to compress each image individually according to the JPeg algorithm. The resulting format is Motion JPEG (corresponds to MPEG-1 if it only contains I-frames). The much higher compression rates nowadays can only be achieved if the similarity of neighboring images (frames) is taken into account when coding. To do this, the image is broken down into smaller boxes (typical sizes are between 4 × 4 and 16 × 16 pixels) and similar boxes are searched for in images that have already been transmitted and used as a template. The saving results from the fact that instead of the entire image content, only the differences between the boxes that are similar in themselves have to be transferred. In addition, the changes from the previous to the current image are used to deduce the direction in which the image content has shifted and how far; only one displacement vector is then stored for the corresponding area.

Compression artifacts

Compression artifacts are signal interferences caused by lossy compression.

Application in communications engineering

Use of source , channel and line coding to transmit a signal

When transmitting data, the amount of data to be transmitted is often reduced by compression. In such a case, one speaks of source coding . Source coding is often used together with channel coding and line coding , but should not be confused with these: While source coding reduces superfluous (redundant) information from a data source, channel coding has the task of reducing transmission or storage errors within the framework of the Recognize and correct data transmission. The line coding, on the other hand, makes a spectral adaptation of the signal to the requirements of the transmission channel.

Time table of the compression algorithms

The centuries-old shorthand can be seen as data compression, which gives the handwriting the highest possible data rate

1833–1865 Development of the Morse code, which translates frequent letters into short characters and rare letters into longer ones, suggesting the idea of ​​entropy coding

1883 David Forsyth , chess player and journalist, publishes a method with which the position of chess figures is recorded in a space-saving way with run-length coding → Forsyth-Edwards notation

1949 Information Theory , Claude Shannon

1949 Shannon-Fano coding

1952 Huffman coding , static

1964 Concept of Kolmogorov Complexity

1975 Integer coding scheme, Elias

1977 Lempel-Ziv method LZ77

1978 Lempel-Ziv procedure LZ78

1979 area coding (an implementation of arithmetic coding )

1982 Lempel-Ziv-Storer-Szymanski (LZSS)

1984 Lempel-Ziv-Welch algorithm (LZW)

1985 Apostolico, Fraenkel, Fibonacci coding

1986 Move to front , (Bentley et al., Ryabko)

1991 Reduced Offset Lempel Ziv (ROLZ, also LZRW4, Lempel Ziv Ross Williams)

1994 Burrows-Wheeler transformation ( bzip2 )

1995 zlib , free standard library for Deflate

1996 Lempel-Ziv-Oberhumer algorithm (LZO), very fast compression

1997 Sequitur

1998 Lempel-Ziv-Markow Algorithm (LZMA)

2006 Hutter Prize for best data compression

2009 PAQ , highest compression rates at the expense of very long runtimes; Use of a neural network (now ZPAQ )

2011 Snappy , fast coder from Google

2011 LZ4 , very fast encoder

2013 zopfli , improved deflate encoder

2015 Brotli , strong compression

Well-known methods of source coding

lossy both possible lossless
AAC ( MPEG )
Aiff
ALS ( MPEG )
Apple Lossless
ATRAC
DjVu
Dolby Digital
DTS
FLAC
Monkey's audio
G.729
GIF
HuffYUV
JPEG
JPEG 2000
LA
MJPEG
MP2 ( MPEG )
MP3 ( MPEG )
MPEG-1
MPEG-2
MPEG-4 (see H.264 , Xvid , DivX )
Musepack
PGF
PNG
TGA
TIFF
Vorbis ( Ogg )
WavPack
WebP
WMA
WMV
photos Audio Video

Data transfer

  • MNP-1 to MNP-10 (Microcom Networking Protocol)
Error correction and data compression protocols from Microcom Inc. for modems , a longstanding standard. Has been improved by:
  • V.42bis - data compression protocol of the ITU-T

biology

Sensory perceptions are filtered, which is also a type of compression, more precisely a lossy compression, since only currently relevant information is perceived. Missing items are subconsciously replaced when necessary. For example, human eyes can only see clearly in a small area ( fovea centralis ); outside this narrow field of vision, missing information is unconsciously replaced by patterns. The human eye can also perceive differences in brightness much better than differences in color tone - the YCbCr color model used in JPEG images uses this fact and saves the color value with much less precision.

Even when listening , weak or missing signals are subconsciously replaced, which algorithms such as MPEG ( MP3 ) or Vorbis make use of.

See also

Web links

Wikibooks: Book on data compression  - learning and teaching materials
Wiktionary: data compression  - explanations of meanings, word origins, synonyms, translations

Individual evidence

  1. data compression - Duden , Bibliographisches Institut ; 2016
  2. a b Stefan Brunthaler: Source and line coding. (PDF; 528 kB) In: TH Wildau . 2018, accessed on August 12, 2018 (lecture script communication technology in telematics SS2018 ).
  3. a b Peter Maluck, Jürg Scheidegger: Source Coding - Guided Discovery Learning. (PDF; 776 kB) In: SwissEduc . August 24, 2009, accessed on August 12, 2018 ( Communication Technology Seminar ).
  4. a b Tilo Strutz: Image data compression - basics, coding, wavelets, JPEG, MPEG, H.264, HEVC . Springer Vieweg , Wiesbaden 2017, ISBN 978-3-8348-1427-2 , pp. 421 .
  5. ^ Matthew V. Mahoney: Fast Text Compression with Neural Networks . In: AAAI (Ed.): Proceedings of the Thirteenth International Florida Artificial Intelligence Research Society Conference . 2000, ISBN 1-57735-113-4 , pp. 5 .
  6. Mark Nelson: The Million Random Digit Challenge Revisited. June 20, 2006, accessed August 12, 2018 .
  7. ^ Mark Nelson: The Enduring Challenge of Compressing Random Data. DrDobbs.com , November 6, 2012, accessed August 12, 2018 .
  8. ^ Patrick Craig: The $ 5000 Compression Challenge. Retrieved August 12, 2018 .