Gestalt Pattern Matching

from Wikipedia, the free encyclopedia

Gestalt Pattern Matching , also Ratcliff / Obershelp Pattern Recognition , is a string matching algorithm for determining the similarity of two character strings . It was developed in 1983 by John W. Ratcliff and John A. Obershelp and in July 1988 in the Dr. Dobb's Journal published.

algorithm

The similarity of two character strings and is determined by dividing twice the number of matching characters by the total number of all characters in both character strings. Matching characters are considered to be those in the longest contiguous matching subsequence plus, recursively, the number of matching characters in the mismatched areas on either side of this longest common subsequence:

where the degree of similarity can assume a value between zero and one:

The value 1 stands for complete agreement, the value 0, however, for no agreement, there is then not even a common letter.

example

S 1 W. I. K I. M. E. D. I. A.
S 2 W. I. K I. M. A. N I. A.

The longest matching sub-sequence is WIKIM(dark gray) with 5 characters. There is no further subsequence to the left of this. The right non-matching sub-sequence EDIAor ANIAhave a matching sub-sequence IA(light gray) with length 2. The degree of similarity is thus determined as follows:

properties

complexity

The running time of the algorithm in O is worst case and mean. However, by changing the procedure, the running time can be significantly improved.

Commutative law

It can be shown that the gestalt pattern matching algorithm is not commutative :

example

For the two strings

and

results for

a measure of the substring , , , , andGESTALTTERI
a measure of the substring , , , .GESTALTTHI

Areas of application

The algorithm became the basis of the diffliblibrary in Python , which was introduced with version 2.1. Due to the unfavorable runtime behavior of the similarity measure, three methods were implemented, two of which can return an upper bound in a faster runtime. The fastest variant only compares the length of the two substrings:

,
# Drqr Implementierung in Python
def real_quick_ratio(s1, s2: str) -> float:
    """Return an upper bound on ratio() very quickly."""
    l1, l2 = len(s1), len(s2)
    length = l1 + l2

    if not length:
        return 1.0

    return 2.0 * min(l1, l2) / length

The second upper bound exposes twice the sum of all characters used that occur in, in relation to the length of both character strings. The character strings are not taken into account.

,
# Dqr Implementierung in Python
def quick_ratio(s1, s2: str) -> float:
    """Return an upper bound on ratio() relatively quickly."""
    length = len(s1) + len(s2)

    if not length:
        return 1.0

    intersect = collections.Counter(s1) & collections.Counter(s2)
    matches = sum(intersect.values())
    return 2.0 * matches / length

Trivially, the following apply:

and
.

complexity

The runtime of this particular Python implementation is worst case and best case.

supporting documents

  1. a b c d e difflib - Helpers for computing deltas in the Python documentation
  2. a b c National Institute of Standards and Technology Ratcliff / Obershelp pattern recognition
  3. Ilya Ilyankou: Comparison of Jaro-Winkler and Ratcliff / Obershelp algorithms in spell check , May 2014 (PDF)
  4. How does Python's SequenceMatcher work? on stackoverflow.com
  5. Borrowed from Python 3.7.0, difflib.py lines 38-41 and 676-686

literature

  • John W. Ratcliff and David Metzener : Pattern Matching: The Gestalt Approach , Dr. Dobb's Journal, Ropes 46, July 1988

See also