Levenshtein distance

The Levenshtein distance (also editing distance ) between two character strings is the minimum number of insert, delete and replace operations to convert the first character string into the second. The distance is named after the Russian scientist Vladimir Lewenstein (Levenshtein), who introduced it in 1965. Mathematically, the Levenshtein distance is a metric in the space of symbol sequences .

For example, the Levenshtein distance between "Tier" and "Tor" is 2. A possible consequence of two operations is:

animal
Toer (replace i with o)
Goal (delete e)

In practice, the Levenshtein distance is used to determine the similarity of character strings, for example to check spelling or to detect duplicates .

algorithm

The Levenshtein article of 1965 defines the Levenshtein distance, but no algorithm is given for calculating the distance. In the literature, an algorithm for calculating the Levenshtein distance using the dynamic programming method is called the Levenshtein algorithm.

The algorithm is the following matrix - recurrences specified, where and denote the two input strings: ${\ displaystyle u}$ ${\ displaystyle v}$

{\ displaystyle m = | u |}

{\ displaystyle n = | v |}

{\ displaystyle D_ {0,0} = 0}

{\ displaystyle D_ {i, 0} = i, 1 \ leq i \ leq m}

{\ displaystyle D_ {0, j} = j, 1 \ leq j \ leq n}

{\ displaystyle D_ {i, j} = \ min {\ begin {cases} D_ {i-1, j-1} & + 0 \ {\ rm {if}} \ u_ {i} = v_ {j} \ \ D_ {i-1, j-1} & + 1 \ {\ rm {(replacement)}} \\ D_ {i, j-1} & + 1 \ {\ rm {(Einf {\ ddot {u}) } gung)}} \\ D_ {i-1, j} & + 1 \ {\ rm {(L {\ ddot {o}} schung)}} \ end {cases}}}

,

{\ displaystyle 1 \ leq i \ leq m, 1 \ leq j \ leq n}

After the matrix is calculated, the Levenshtein distance, i.e. H. the minimum number of edit operations in the matrix cell . In each cell there is the Levenshtein distance of the prefixes and . In the event of a further deletion or insertion, only one more character from or is used. Ie the result is saved in or . Thus, the costs for a deletion or insertion in the cell can be recurrently calculated or calculated. The case of replacement can be explained in the same way. ${\ displaystyle D}$ ${\ displaystyle D_ {m, n}}$ ${\ displaystyle D_ {i, j}}$ ${\ displaystyle u_ {0, i}}$ ${\ displaystyle v_ {0, j}}$ ${\ displaystyle u}$ ${\ displaystyle v}$ ${\ displaystyle D_ {i + 1, j}}$ ${\ displaystyle D_ {i, j + 1}}$ ${\ displaystyle D_ {i, j}}$ ${\ displaystyle D_ {i-1, j} +1}$ ${\ displaystyle D_ {i, j-1} +1}$

If not only the Levenshtein distance is to be calculated, but also the sequence of operations that led to the result, a "backtrace" must be carried out on the calculated matrix. That is, starting with , the optimization decisions are traced back recursively. In the pseudocode the backtrace algorithm is: ${\ displaystyle D_ {i, j-1}}$

string backtrace(i, j):
  if i>0 && D[i-1,j] + 1 == D[i,j]
    return backtrace(i-1, j) + " Löschung "
  if j>0 && D[i,j-1] + 1 == D[i,j]
    return backtrace(i, j-1) + " Einfügung "
  if i>0 && j>0 && D[i-1, j-1] + 1 == D[i,j]
    return backtrace(i-1, j-1) + " Ersetzung "
  if i>0 && j>0 && D[i-1, j-1]  == D[i,j]
    return backtrace(i-1, j-1) + " Gleichheit "
  return ""

To simplify the pseudo-code, only one possible backtrace is calculated.

example

The procedure can now easily be carried out in a table. Here is an example with the two strings “animal” and “gate”.

The distance between the two strings is 2. ε stands for the empty string "".

complexity

The running time of the algorithm is in O , since all cells of the matrix are calculated and the computational effort per cell is constant. ${\ displaystyle (mn)}$ ${\ displaystyle (m + 1) \ times (n + 1)}$

The memory requirement is in because the matrix has rows and columns. If, however, no backtracing is carried out, then only the last row or column is required to calculate the next row or column for the row or column calculation of the matrix. The memory requirement in this case is in . ${\ displaystyle O (nm)}$ ${\ displaystyle m + 1}$ ${\ displaystyle n + 1}$ ${\ displaystyle O (\ min (m, n))}$

The Hirschberg algorithm enables the calculation of the Levenshtein distance and a "backtrace" in quadratic time and with linear memory consumption.

Barriers

There are some upper and lower bounds for the Levenshtein distance :

it is at least the difference between the lengths of the two strings
it is at most the length of the longer character string
it is 0 if and only if both strings are identical (definition of metric)
it is not greater than the Hamming distance plus the difference in length between the character strings

Demarcation

The Levenshtein distance can be viewed as an extension of the Hamming distance , which is limited to replacements and can therefore only measure strings of the same length. A phonetic search can use the Levenshtein distance to allow errors. Words to be compared can be converted into a sound symbol string before a distance calculation, for example using the Cologne phonetics or the Soundex algorithm.

The DP algorithm for calculating the Levenshtein distance is related to the Needleman-Wunsch algorithm for calculating the sequence alignment of two character strings. The Needleman Wunsch algorithm is in and allows any cost functions for inserting and deleting operations of length . In the Levenshtein algorithm, an insert or delete operation consists of a maximum of one character. Variants of the Needleman-Wunsch algorithm limit the choice of the cost function so that its running time is in . ${\ displaystyle O (n ^ {3})}$ ${\ displaystyle \ geq 1}$ ${\ displaystyle O (n ^ {2})}$

The Smith-Waterman algorithm optimizes the costs of the edit operations according to the same DP scheme as the Needleman Wunsch or Levenshtein algorithm, but insert and delete operations are included at the beginning and end of a sequence evaluated and ignored in the backtrace. In other words, the Smith-Waterman algorithm calculates the local "sequence alignment". Its duration is in . ${\ displaystyle 0}$ ${\ displaystyle O (n ^ {2})}$

The Levenshtein distance can be viewed as a special form of the dynamic time warping distance (DTW).

variants

Weighted Levenshtein distance

The costs of the individual insert, delete and replace operations can also be weighted differently or even depending on the characters involved. The method generalized in this way is referred to as the weighted Levenshtein distance, weighted Levenshtein distance or the WLD algorithm for short.

An example of weighted costs for distance operations that can be minimized with the WLD algorithm is the cost of typewriter distance .

Damerau-Levenshtein distance

Damerau-Levenshtein extends the functionality of Levenshtein by the ability to identify two exchanged characters, for example "Raisch" ↔ "Rasich". In order to convert one character string into the other, two operations would be necessary with Levenshtein. Damerau-Levenshtein only needs one.

The following matrix recurrences specify the algorithm variant.

{\ displaystyle m = | u |}

{\ displaystyle n = | v |}

{\ displaystyle D_ {0,0} = 0}

{\ displaystyle D_ {i, 0} = i}

,

{\ displaystyle 1 \ leq i \ leq m}

{\ displaystyle D_ {0, j} = j}

,

{\ displaystyle 1 \ leq j \ leq n}

{\ displaystyle D_ {i, j} = \ min {\ begin {cases} D_ {i-1, j-1} & + 0 \ {\ rm {if}} \ u_ {i} = v_ {j} \ \ D_ {i-1, j-1} & + 1 \ {\ rm {(replacement)}} \\ D_ {i, j-1} & + 1 \ {\ rm {(Einf {\ ddot {u}) } gung)}} \\ D_ {i-1, j} & + 1 \ {\ rm {(L {\ ddot {o}} schung)}} \ end {cases}}}

{\ displaystyle (i = 1.1 \ leq j \ leq n) \ vee (1 \ leq i \ leq m, j = 1)}

{\ displaystyle D_ {i, j} = \ min {\ begin {cases} D_ {i-1, j-1} & + 0 \ {\ rm {if}} \ u_ {i} = v_ {j} \ \ D_ {i-1, j-1} & + 1 \ {\ rm {(replacement)}} \\ D_ {i, j-1} & + 1 \ {\ rm {(Einf {\ ddot {u}) } gung)}} \\ D_ {i-1, j} & + 1 \ {\ rm {(L {\ ddot {o}} schung)}} \\ D_ {i-2, j-2} & + c \ {\ rm {Swap if}} \ u_ {i} = v_ {j-1} \ wedge u_ {i-1} = v_ {j} \\\ end {cases}}}

{\ displaystyle 2 \ leq i \ leq m, 2 \ leq j \ leq n}

Where denotes the cost of swapping two characters. ${\ displaystyle c}$

literature

Fred J. Damerau: A technique for computer detection and correction of spelling errors . In: Communications of the ACM . tape 7 , no. 3 , March 1964, p. 171-176 .
Vladimir I. Levenshtein: Binary codes capable of correcting deletions, insertions, and reversals . In: Doklady Akademii Nauk SSSR . tape 163 , no. 4 , 1965, pp. 845-848 (Russian, English translation in: Soviet Physics Doklady, 10 (8) pp. 707-710, 1966).

Web links

Explanation and online visualization with Flash (English)
Explanation and online visualization with JavaScript (English)
Java applet for calculating the Levenshtein distance, and Hamming distance (English)
Algorithm Levenshtein distance (Wikibooks) (English)