Deduplication (from English deduplication ), and data de-duplication or deduplication is, in the information technology a process of redundant data is identified ( duplicate detection ) and eliminated before it is written to a nonvolatile media. The process is compressed , like other methods, the amount of data that is sent from a transmitter to a receiver. It is almost impossible to predict the efficiency of using deduplication algorithms, as it always depends on the data structure and the rate of change. Deduplication can be a very efficient method of reducing the amount of data that can be pattern-matched (unencrypted data).
The primary area of application of deduplication is initially data protection ( backup ), in which, in practice, more data compression can usually be achieved than with other methods. The process is basically suitable for any area of application in which data is repeatedly copied.
Deduplication systems divide the files into blocks of the same size (usually powers of two ) and calculate a checksum for each block . This also distinguishes it from Single Instance Storage (SIS), which is supposed to eliminate identical files (see also content-addressed storage systems, CAS ).
All checksums are then saved together with a reference to the corresponding file and the position within the file. If a new file is added, its content is also divided into blocks and the checksums are calculated from them. A comparison is then made to determine whether a checksum already exists. This comparison of the checksums is much faster than comparing the file contents directly with one another. If an identical checksum is found, this is an indication that an identical data block may have been found, but it must still be checked whether the contents are actually identical, as this could also be a collision .
If an identical data block is found, one of the blocks is removed and only a reference to the other data block is saved instead. This reference takes up less space than the block itself.
There are two methods for selecting the blocks. With “reverse referencing”, the first shared block is saved, all other identical blocks receive a reference to the first. “Forward referencing” always stores the last common data block that occurred and references the elements that occurred previously. This method dispute is about whether to save data faster or recover it faster. Other approaches such as “inband” and “outband” compete over whether the data stream is analyzed “on the fly”, ie during operation, or only after it has been saved at the destination. In the first case only one data stream may exist, in the second the data can be examined in parallel using several data streams.
When backing up data from hard disks to tape media, the ratio of new or changed to unchanged data between two full backups is usually only relatively low. However, two full backups still require at least twice the storage capacity on the tape compared to the original data with the classic data backup. The deduplication recognizes the identical data components. For this purpose, unambiguous segments are recorded in a list, and when this data part occurs again, the time and place in the data stream are noted so that ultimately the original data can be restored.
However, these are no longer independent full backups. This means that the loss of a version results in irretrievable data loss. Deduplication, like incremental backups , does away with data security and in favor of storage requirements.
The aim is to break the data up into pieces so that as many identical data blocks as possible are created that can be deduplicated. The process of dismantling is called chunking (from English chunk , 'piece', 'block'). The process of uniquely identifying blocks is called fingerprinting and can, for example, be carried out using a cryptographic hash function .
The more detailed changes to a file can be determined, the less redundant backups need to be made. However, this increases the index, i.e. the construction plan, how and from which components the file is reassembled when it is called up. This trade-off must be taken into account when choosing the block size for chunking.
- Oliver Kluge: File systems with deduplication in the test. In: Linux magazine .
- Michael Bergler: What is de-dublication or deduplication?
- Remove Duplicate Lines From a List, Online Deduplication. In: text-filter.com.