Duplicate detection

from Wikipedia, the free encyclopedia

Under duplicate detection or object identification (also English record linkage ) comprises various automatic procedures that are in records cases identified leave that represent the same object in the real world. This is necessary , for example, when merging several data sources ( deduplication ) or when cleaning data .

Duplicates can arise, for example, from input and transmission errors , due to different spellings and abbreviations or due to different data schemes. For example, addresses from different sources can be included in an address database , wherein one and the same address can be included several times with variations. Using duplicate detection, these duplicates should now be found and the actual addresses identified as objects.

There are two types of duplicates: identical duplicates , in which all values ​​are identical, and non- identical duplicates , in which one or more values ​​differ. The detection and cleanup is trivial in the first case; the excess duplicates can simply be deleted without loss of information. The second case is more difficult and complex, as the duplicates cannot be identified using a simple is-equal comparison as in the first case. For this reason heuristics must be used. In the second case, the excess data records cannot simply be deleted; they must first be consolidated and the values ​​summarized.

The process of detecting and consolidating duplicates

The process for recognizing and consolidating duplicates can be carried out in the following four steps (Apel, 2009, p. 164):

  1. Preprocessing the data
  2. Partitioning of the data
  3. Detection of duplicates and
  4. Consolidation into one data set.

Various similarity measures are used to identify duplicates , for example the Levenshtein distance or the typewriter distance . Usually for reasons of cost because each record can not be compared with each other, there are methods such as assorted neighborhood (English Sorted Neighborhood ), are reviewed in the only potentially similar records to determine whether they are duplicates.

There are phonetic algorithms that assign a sequence of characters to words according to their speech sound, the phonetic code in order to implement a similarity search, for example Soundex and Cologne Phonetics .

Examples

The following entries from a list of names may be duplicates:

Max Muller
Max Mueller
M. Muller
Max Muller

With a library can duplicate occur when multiple library catalogs are merged.

See also

literature

  • Detlef Apel, Wolfgang Behme, Rüdiger Eberlein, Christian Merighi: Controlling data quality successfully - practical solutions for BI projects . 2009, Hanser Fachbuch, ISBN 978-3-446-42056-4 .
  • Felix Naumann ( Hasso Plattner Institute , HPI), Melanie Herschel ( University of Tübingen ): An Introduction to Duplicate Detection - Synthesis Lectures on Data Management . 2010, Morgan & Claypool Publishers. doi : 10.2200 / S00262ED1V01Y201003DTM003 .
  • Felix Naumann (HPI): Data Profiling and Data Cleansing - Similarity measures ( PDF ). Lecture documents from June 11, 2013.
  • Jürgen Nemitz: Data networking in a historical research project, in: EDV-Tage Theuern , Theuern 2000