Delta coding

from Wikipedia, the free encyclopedia

Delta coding or differential storage describes techniques for the transmission and storage of data in the form of changes and not as entire files. The changes that are saved in discrete files are called "deltas" or "diffs". As techniques for data compression , they reduce the memory and bandwidth requirements when processing correlated data such as sequential data (= data that is available in several versions ).

function

The starting point is a data set that exists in two or more versions. An example of this are the source texts of programs, which are saved in every version of their creation. With conventional technology, all of the data must be saved for each version.

Usually, two successive versions differ only slightly. Often only single typing errors are exchanged in huge amounts of data. The idea behind delta coding is to save only the changes, not both versions as a whole.

example

There are two versions of a text, both of which should be saved:

  • This is an example sentence.
  • This is another example sentence.

In order to save the information content of the second sentence, it does not have to be saved in full. It is sufficient if the first sentence remains stored and for the second sentence only the information "insert the word of another person after the third word " is stored. This only saves the difference to the first version, which may result in considerable data savings.

implementation

There are two methods for delta storage:

  • The original text remains in its original form, only the changes to the next new version are recorded.
  • The updated text is saved and the change from the previous version is recorded.

Every change is recorded in one or more delta storage records . This contains the position from the beginning of the file, then information on whether a number of bytes should be removed or which byte sequence should be inserted. There can be any number of such records between two versions. Typically these records are appended to the file along with information about which version they belong to. This procedure can be used as often as required and thus provides a complete history of the versions of a file.

Make the version you want

Starting from the basic version, the changes are made one after the other in the correct order of the versions in order to obtain the desired version.

Usually the current version is the most frequently used. Therefore it usually makes sense to use the second method of saving (full display of the current version and saving of the changes to the previous versions). It has also proven itself with many version control systems, such as: B. RCS and CVS enforced. In the case of branches in the version history, the second variant is also used there. You go back from the current version to the branch point, then forwards to the desired version of the branch.

use cases

Two use cases of delta coding are backup systems (e.g. rsync ) and software version management tools (e.g. Git).

Backup systems

Many backup tools use delta encoding. In addition to reducing the required storage space, it allows earlier versions of files to be created. Without delta coding, the entire file would have to be saved for each backup run, which would increase the storage space required and the duration of the backup.

Git

The distributed version management system Git uses delta coding in a so-called "Git Repack" operation. Objects in the repository that have not yet been delta-compressed (so-called "loose objects") are compared with a heuristically selected subset of all other objects. The common data and deltas are put together in a so-called “pack file” and then compressed using conventional methods. In normal use cases where files are changed incrementally from commit to commit, this procedure results in substantial memory savings.

fitness

The nature of the data is critical to the effectiveness of any data compression algorithm . Delta coding is particularly suitable for data that has slight and constant differences from version to version. In this case, delta coding significantly reduces data redundancy . An example of this is a large text document in which only a few sentences are changed.

For an unsorted data set, the degree of compression can be low or nonexistent. Typical binary files like executables have too many changes from version to version that differential storage does not have a compression effect. In fact, the data can even be extended with it.

In the case of originally compressed files, decompression can simplify subsequent delta encoding.

In video compression , delta coding is used with the difference-coded P and B frames and here contributes a very high proportion to the efficiency of the respective method.

See also