Sorted neighborhood

from Wikipedia, the free encyclopedia

Assorted neighborhood (engl. Sorted neighborhood ) is a method for duplicate detection . The core idea is to sort the data records in which duplicates are to be found so that potential duplicates are as close together as possible and therefore only data records that are close together need to be compared with each other in a neighborhood. As a result of the procedure, groups (clusters) of presumed duplicates are obtained.

Procedure

The data records to be checked are given in a table with several attributes. Sorted neighborhood works in three phases:

Key generation A key is a character string that is formed from (parts of) relevant attributes. The effectiveness of the algorithm is largely dependent on a clever choice of the key. The key is only a virtual one and, in contrast to a hash value, is not unique. It is only used to sort the tuples.

Sorting

  • the tuples are sorted lexicographically according to the generated key
  • The aim of the sorting is that tuples that represent duplicates of each other are grouped as close to each other as possible
  • the sorting algorithm can be chosen as desired (e.g. merge or alpha sort for secondary storage )

merger

  • a window of a fixed size is moved line by line over the sorted tuples (sliding window)
  • be the number of tuples , then applies to the size of the window
  • the key was only used to sort the tuples; to find duplicates, the attributes of the tuples are compared in pairs within the window
  • Complex rules are used to compare the individual attributes of the tuple pairs, which determine when two tuples are duplicates (in addition to the Levenshtein distance , other comparisons such as the length of character strings or numerical values ​​are made)

effort

Let the number of tuples and the line size of the window be, then the effort is according to the 3 phases key generation , sorting and merging. If then is the overall complexity , if not, it is what becomes for a window size of N (ie so that maximum duplicates are found) .

improvement

One improvement is to create a transitive shell over the duplicates found. This allows the window size and thus the number of necessary comparisons to be reduced.

Let us denote that and are duplicates, then we can deduce from , with , and , with , that even if there is no window in which and are common.

Multipass technology

Depending on the choice of key, it is possible that duplicates are very far apart. The trivial option would be to enlarge the window. However, this increases the comparison effort enormously.

A better option is the multipass procedure. Here, a sorted neighborhood is carried out several times, with different keys and relatively small ones . Then the transitive envelope is formed over all results.

Elaborate and sorted neighborhoods after Monge and Elkan

The procedure in brief:

  • to achieve domain independence, tuples are interpreted as strings; the procedure is then carried out twice, once the tuple is interpreted forwards as a character string and once backwards
  • Groups of duplicates are formed, each represented by a representative, and comparisons are made with only these representatives
  • a graph structure is used that includes the following:
    • one node per tuple
    • an edge from tuple x to tuple y if tuples x and y are duplicates
    • Networked duplicates form a graph, referred to here as connected components
    • there are two operations on the graph:
      • Union(x,y) connects the two tuples x and y and thus also the two connected components that contain x and y to form a newer, larger connected component (the old individual components are not saved) and a representative is chosen for this connected component
      • Find(x) returns the representative for the connected component that contains the tuple x
  • The algorithm uses a priority queue (priority queue); this contains several sets, each of which originates from one or more tuples from a single group of (duplicate) tuples; Each of these sets contains the tuples that represent a group of (duplicate) tuples
  • the tuples are read in sequentially and a check is made for each whether it is already a member of one of the groups that are represented in the priority queue (using Find(Tupel)); if yes, the next tuple is read, if no, then it is compared with the representatives of the groups in the priority queue; if it is recognized as a duplicate, then an is Union(Tupel,Repräsentant)executed and the related components are connected with it, if not, the tuple is stored as a set of ones in the priority queue
  • the quantity in the priority queue that represents the group that contains the last group member found has the highest priority, in descending order below
  • since the tuples have been sorted beforehand, tuples belonging to the same group will i. A. found one after the other, which is why a small number (e.g. 4) of quantities in the priority queue is sufficient

literature

  • Alvaro Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the Workshop on Research Issues on Data Mining and Knowledge Discovery. 1997. PDF
  • Sergio Jimenez, Claudia Becerra, Alexander Gelbukh, Fabio Gonzalez: Generalized Monge-Elkan Method for Approximate Text String Comparison . 2009. PDF . doi : 10.1007 / 978-3-642-00382-0_45