Gestalt Pattern Matching

Gestalt Pattern Matching , also Ratcliff / Obershelp Pattern Recognition , is a string matching algorithm for determining the similarity of two character strings . It was developed in 1983 by John W. Ratcliff and John A. Obershelp and in July 1988 in the Dr. Dobb's Journal published.

algorithm

The similarity of two character strings and is determined by dividing twice the number of matching characters by the total number of all characters in both character strings. Matching characters are considered to be those in the longest contiguous matching subsequence plus, recursively, the number of matching characters in the mismatched areas on either side of this longest common subsequence: ${\ displaystyle S_ {1}}$ ${\ displaystyle S_ {2}}$ ${\ displaystyle K_ {m}}$

{\ displaystyle D_ {ro} = {\ frac {2K_ {m}} {| S_ {1} | + | S_ {2} |}}}

where the degree of similarity can assume a value between zero and one:

{\ displaystyle 0 \ leq D_ {ro} \ leq 1}

The value 1 stands for complete agreement, the value 0, however, for no agreement, there is then not even a common letter.

example

S ₁	W.	I.	K	I.	M.	E.	D.	I.	A.
S ₂	W.	I.	K	I.	M.	A.	N	I.	A.

The longest matching sub-sequence is WIKIM(dark gray) with 5 characters. There is no further subsequence to the left of this. The right non-matching sub-sequence EDIAor ANIAhave a matching sub-sequence IA(light gray) with length 2. The degree of similarity is thus determined as follows:

{\ displaystyle {\ frac {2K_ {m}} {| S_ {1} | + | S_ {2} |}} = {\ frac {2 \ cdot (| {\ text {'' WIKIM ''}} | + | {\ text {'' IA ''}} |)} {| S_ {1} | + | S_ {2} |}} = {\ frac {2 \ cdot (5 + 2)} {9 + 9 }} = {\ frac {14} {18}} = 0. {\ overline {7}}}

properties

complexity

The running time of the algorithm in O is worst case and mean. However, by changing the procedure, the running time can be significantly improved. ${\ displaystyle (n ^ {3})}$ ${\ displaystyle O (n ^ {2})}$

Commutative law

It can be shown that the gestalt pattern matching algorithm is not commutative :

{\ displaystyle D_ {ro} (S_ {1}, S_ {2}) \ neq D_ {ro} (S_ {2}, S_ {1}).}

example

For the two strings

{\ displaystyle S_ {1} = {\ text {GESTALT PATTERN MATCHING}}}

and

{\ displaystyle S_ {2} = {\ text {GESTALT-THEORY}}}

results for

{\ displaystyle D_ {ro} (S_ {1}, S_ {2})}

a measure of the substring , , , , and

{\ displaystyle {\ frac {22} {39}}}

GESTALTTERI

{\ displaystyle D_ {ro} (S_ {2}, S_ {1})}

a measure of the substring , , , .

{\ displaystyle {\ frac {20} {39}}}

GESTALTTHI

Areas of application

The algorithm became the basis of the diffliblibrary in Python , which was introduced with version 2.1. Due to the unfavorable runtime behavior of the similarity measure, three methods were implemented, two of which can return an upper bound in a faster runtime. The fastest variant only compares the length of the two substrings:

{\ displaystyle D_ {rqr} = {\ frac {2 \ cdot \ min (| S1 |, | S2 |)} {| S1 | + | S2 |}}}

,

# Drqr Implementierung in Python
def real_quick_ratio(s1, s2: str) -> float:
    """Return an upper bound on ratio() very quickly."""
    l1, l2 = len(s1), len(s2)
    length = l1 + l2

    if not length:
        return 1.0

    return 2.0 * min(l1, l2) / length

The second upper bound exposes twice the sum of all characters used that occur in, in relation to the length of both character strings. The character strings are not taken into account. ${\ displaystyle S_ {1}}$ ${\ displaystyle S_ {2}}$

{\ displaystyle D_ {qr} = {\ frac {2 \ cdot {\ big |} \ {\! \ vert S1 \ vert \! \} \ cap \ {\! \ vert S2 \ vert \! \} {\ big |}} {| S1 | + | S2 |}}}

,

# Dqr Implementierung in Python
def quick_ratio(s1, s2: str) -> float:
    """Return an upper bound on ratio() relatively quickly."""
    length = len(s1) + len(s2)

    if not length:
        return 1.0

    intersect = collections.Counter(s1) & collections.Counter(s2)
    matches = sum(intersect.values())
    return 2.0 * matches / length

Trivially, the following apply:

{\ displaystyle 0 \ leq D_ {ro} \ leq D_ {qr} \ leq D_ {rqr} \ leq 1}

and

{\ displaystyle 0 \ leq K_ {m} \ leq | \ {\! \ vert S1 \ vert \! \} \ cap \ {\! \ vert S2 \ vert \! \} {\ big |} \ leq \ min (| S1 |, | S2 |) \ leq {\ frac {| S1 | + | S2 |} {2}}}

.

complexity

The runtime of this particular Python implementation is worst case and best case. ${\ displaystyle O (n ^ {2})}$ ${\ displaystyle O (n)}$

supporting documents

↑ ^a ^b ^c ^d ^e difflib - Helpers for computing deltas in the Python documentation
↑ ^a ^b ^c National Institute of Standards and Technology Ratcliff / Obershelp pattern recognition
↑ Ilya Ilyankou: Comparison of Jaro-Winkler and Ratcliff / Obershelp algorithms in spell check , May 2014 (PDF)
↑ How does Python's SequenceMatcher work? on stackoverflow.com
↑ Borrowed from Python 3.7.0, difflib.py lines 38-41 and 676-686

literature

John W. Ratcliff and David Metzener : Pattern Matching: The Gestalt Approach , Dr. Dobb's Journal, Ropes 46, July 1988

Gestalt Pattern Matching

contents

algorithm

example

properties

complexity

Commutative law

Areas of application

complexity

supporting documents

literature

See also