Knuth-Morris-Pratt algorithm
The Knuth-Morris-Pratt algorithm was named after Donald Ervin Knuth , James Hiram Morris and Vaughan Ronald Pratt and is a string matching algorithm . Its asymptotic running time is linear in the length of the pattern (also search term , search mask ), which is searched for, plus the length of the searched text.
description
The Morris-Pratt algorithm is based on the naive search algorithm . The main difference is that the comparison window is not always moved forward by just one position, but possibly by more than one position.
To do this, the search mask must initially be analyzed for character strings that are as long as possible a prefix of the pattern itself. The algorithm then creates a table that contains the length of the real edge of the prefix for each prefix of length j , i.e. the maximum number of characters within the prefix of the search pattern that are both suffix and prefix of the same. It is also defined that the length of the real edge for a prefix of length zero is −1. This will be useful later in the algorithm when searching.
The following example illustrates what has been said:
length | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | |
Template: | a | b | a | b | c | a | b | a | b | 9 |
Prefix (0..0): | 0 | |||||||||
Prefix (0..1): | 0 | |||||||||
Prefix (0..2): | a | a | 1 | |||||||
Prefix (0..3): | a | b | a | b | 2 | |||||
Prefix (0..4): | 0 | |||||||||
Prefix (0..5): | a | a | 1 | |||||||
Prefix (0..6): | a | b | a | b | 2 | |||||
Prefix (0..7): | a | b | a | a | b | a | 3 | |||
Prefix (0..8): | a | b | a | b | a | b | a | b | 4th |
During the search, the procedure is the same as with the naive search: The characters in the text and search mask are compared at position 0 until two characters no longer match or the search mask is found. If the search mask was found, the algorithm is complete. If the characters do not match before a complete hit, the search mask is shifted to the right by the difference between the number of matching characters and the length of the edge of the prefix. This is where the previous definition of the length of the edge of a prefix of length zero comes into play: the difference between 0 and −1 is 1, so if the character does not match, the first position is shifted by one to the right. In addition, the search for the length of the edge of the prefix or zero, if the edge is shorter than zero, begins further to the right.
With each partial match, for example the first k symbols, it is now known whether the beginning of the search mask matches the end of the last matching partial mask. The search mask is moved after the overlapping match; An additional advantage is that the symbols that have already been compared do not have to be compared again.
The information obtained in this way is used during the search in order to avoid repeated comparisons.
Consequently, the algorithm is divided into two phases, namely
- the prefix parsing, which parses the pattern itself, and
- the actual search.
search
As an example we are looking for the pattern “ababcabab” in the text “abababcbababcababcab ...”.
As with the Naive Algorithm, the pattern is written left-justified below the text and the letter pairs are compared from left to right until the pattern and text no longer match (an error occurs ):
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | a | b |
The first error is found at position 4 in the text. If we look at the previously calculated table with the length of the edges of a prefix at the position "Prefix (0..3)", we see that the length 2 is stored here. The pattern is shifted 2 characters to the right so that it overlaps appropriately with the suffix (ie the "second part" of the border); in addition, the search continues directly for the edge, since we already know that the two parts match (this is the great strength of the KMP algorithm):
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | a | b |
The algorithm continues with the comparison at position 4, i.e. exactly where an error was previously found. In particular, the "from" margin is not checked again.
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | a | b |
The next error occurs at position 7 in the text, and thus at position 5 in the search pattern. We look again at our table at "Prefix (0..4)", it says that there are no characters here that form a border (length zero). So we can now be sure that there are no hits. We searched the characters up to position 7 by naively sliding one character to the right. Therefore, the pattern can be moved to under 7 in the text, the result of search text position + (number of matching characters - margin length (prefix length)), i.e. 2 + (5 - 0) = 7:
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | a | b |
The algorithm then continues with the comparison again at position 7.
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | a | b |
Sometimes, as here, an error occurs in the first position of the pattern. In this case the pattern is shifted 1 to the right; Search text position + (number of matching characters - margin length (prefix length)), i.e. 7 + (0 - (−1)) = 8, and the algorithm continues with comparisons at the next position in the text (i.e. 8).
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | a | b |
If all characters of the pattern were found in the text, a hit is output.
Since the last checked four characters "abab" at positions 13 to 16 are a prefix of length 4, the pattern is moved to position 13. The comparison is continued again at position 17 (= 13 + 4):
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | ... |
The algorithm ends as soon as the pattern cannot be shifted further to the right without protruding beyond the end of the text, or as soon as the end of the text is reached.
Observations
- The text is run through exactly once.
- The algorithm either moves further to the right in the text, or it stops in the text and shifts the pattern.
- If the pattern is to be shifted and there is a prefix of length k in front of the character just checked , the pattern is shifted so far that the first k characters are not checked again.
Prefix analysis
In order to find all longest prefixes in the pattern, the pattern is analyzed before the actual search.
To do this, write −1 under the first character in the pattern, and under each additional character the number of characters immediately preceding that form a prefix of the pattern without starting at the beginning of the pattern.
As an example, we will process the pattern "ababcabab" again:
template | value | comment |
---|---|---|
a | −1 | Special case at the beginning of the pattern. |
b | 0 | Without starting at the beginning of the pattern, there are no preceding characters. |
a | 0 | The preceding characters without starting at the beginning of the pattern are “b”. This is not a prefix of the pattern. |
b | 1 | The preceding characters without starting at the beginning of the pattern are “ba”. Only the "a" is also the prefix of the pattern. |
c | 2 | The preceding characters without starting at the beginning of the pattern are "bab". Only the "ab" is also the prefix of the pattern. |
a | 0 | The preceding characters without starting at the beginning of the pattern are “babc”. The prefix “ab” is included, but because of the “c” it is not directly in front of this “a”. |
b | 1 | The preceding characters without starting at the beginning of the pattern are “babca”. Only the "a" is also the prefix of the pattern. |
a | 2 | The preceding characters without starting at the beginning of the pattern are "babcab". Only the "ab" is also the prefix of the pattern. |
b | 3 | The preceding characters without starting at the beginning of the pattern are “babcaba”. Only the "aba" is also the prefix of the pattern. |
4th | The last characters of the pattern without starting at the beginning are “babcabab”. Only the "abab" is also the prefix of the pattern. |
This results in the following prefix table for the “ababcabab” pattern . Note that the last line of the table is one field longer than the pattern.
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 |
---|---|---|---|---|---|---|---|---|---|---|
Template: | a | b | a | b | c | a | b | a | b | |
Value: | −1 | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4th |
For comparison, the table above:
length | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Position: | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | |
Template: | a | b | a | b | c | a | b | a | b | 9 |
Prefix (0..0): | 0 | |||||||||
Prefix (0..1): | 0 | |||||||||
Prefix (0..2): | a | a | 1 | |||||||
Prefix (0..3): | a | b | a | b | 2 | |||||
Prefix (0..4): | 0 | |||||||||
Prefix (0..5): | a | a | 1 | |||||||
Prefix (0..6): | a | b | a | b | 2 | |||||
Prefix (0..7): | a | b | a | a | b | a | 3 | |||
Prefix (0..8): | a | b | a | b | a | b | a | b | 4th |
implementation
The following is the algorithm described in pseudocode .
Be input
- a text t of length m , and
- a pattern w of length n .
For every occurrence of the pattern w in the text t , the starting position of the word in the text should be output.
We regard patterns and text as an array , which are numbered starting with zero. So z. B. w [0] the first character of the pattern, and t [8] the ninth character of the text. It is common practice to start numbering with 0 .
Prefix analysis
First, the prefix analysis is carried out. It creates the prefix table discussed above , here only the last line as an array N , which contains the length of the directly preceding prefix for each character of the pattern.
Input is
- a pattern w of length n .
Issue is
- the array N of length n + 1 .
in pseudocode:
i := 0 // Variable i zeigt immer auf die aktuelle Position
j := -1 // im Muster, Variable j gibt die Länge des gefun-
// denen Präfixes an.
N[i] := j // Der erste Wert ist immer -1
while i < n // solange das Ende des Musters nicht erreicht ist
|
| while j >= 0 and w[j] != w[i] // Falls sich ein gefundenes
| | // Präfix nicht verlängern lässt,
| | j := N[j] // nach einem kürzeren suchen.
| |
| end
|
| // an dieser Stelle ist j=-1 oder w[i]=w[j]
|
| i := i+1 // Unter dem nächsten Zeichen im Muster
| j := j+1 // den gefundenen Wert (mindestens 0)
| N[i] := j // in die Präfix-Tabelle eintragen.
|
end
The same code in Python:
def prefix(muster, laenge):
# Laenge des gefundenen Prefix
laengePrefix = -1
# Der erste Wert ist immer -1
prefixTabelle = [laengePrefix]
# solange das Ende des Musters nicht erreicht ist
for positionImMuster in range(0, laenge):
# solange sich ein gefundenes Prefix nicht verlaengern laesst, nach einem kuerzeren suchen
while laengePrefix >= 0 and muster[laengePrefix] is not muster[positionImMuster]:
laengePrefix = prefixTabelle[laengePrefix]
# an dieser Stelle ist laengePrefix == -1 oder muster[positionImMuster] == muster[leangePrefix]
laengePrefix = laengePrefix + 1
# den gefundenen Wert an die prefixTabelle anhaengen
prefixTabelle.append(laengePrefix)
return prefixTabelle
search
The second phase is the search. Since the pattern is of course not actually written under the text and then moved, two variables i and j are used. I indicates the current position in the text and j the current position in the pattern. The meaning is that position j of the pattern is always “under” position i of the text.
i : | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th | 16 | 17th | 18th | 19th | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text: | a | b | a | b | a | b | c | b | a | b | a | b | c | a | b | a | b | c | a | b | ... |
Template: | a | b | a | b | c | a | b | a | b | ||||||||||||
j : | 0 | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th |
Are input
- the pattern w of length n ,
- the array N of length n + 1 from the first phase, and
- a text t of length m .
Output are
- all positions where w occurs in t .
i := 0 // Variable i zeigt immer auf die aktuelle Position im Text.
j := 0 // Variable j auf die aktuelle Position im Muster.
while i < m // also Textende nicht erreicht
|
| while j >= 0 and t[i] != w[j] // Muster verschieben, bis
| | // Text und Muster an Stelle
| | j := N[j] // i,j übereinstimmen. Dabei
| | // Array N benutzen.
| end
|
| i := i + 1 // In Text und Muster je eine
| j := j + 1 // Stelle weitergehen.
|
| if j == n then /// Ist das Ende des Musters erreicht
| | // einen Treffer melden. Dieser begann
| | print i - n // bereits n Zeichen früher.
| |
| | j := N[j] // Muster verschieben.
| |
| end if
|
end
The same code in Python:
def suche(muster, prefixTabelle, text):
positionImMuster = 0
musterLaenge = len(muster)
# solange das Ende des Textes nicht erreicht ist
for positionImText in range(0, len(text)):
# Muster verschieben bis Text und Muster an Stelle (positionImText, positionImMuster) uebereinstimmen
while positionImMuster >= 0 and text[positionImText] is not muster[positionImMuster]:
# Dazu wird die Prefix-Tabelle verwendet
positionImMuster = prefixTabelle[positionImMuster]
positionImMuster = positionImMuster + 1
# falls das Ende des Musters erreicht ist
if positionImMuster is musterLaenge:
# einen Treffer melden. Der Treffer beginnt bereits musterLaenge Zeichen frueher.
print(positionImText - musterLaenge)
# Muster verschieben
positionImMuster = prefixTabelle[positionImMuster]
Runtime analysis
As usual, the running time is given in Landau notation .
Runtime of the prefix analysis
The outer while loop is run through at most n times, since i = 0 at the beginning and i is increased by 1 with each step.
In the inner while loop, j is set to a previously calculated, smaller value of j , stored in N [j] , on each run . In total, this loop is run through at most as often as j was increased. Since j is only increased in the outer loop, the inner loop is iterated at most n times in total .
However, the entire pattern must also be run through. Therefore the runtime of the prefix analysis is in .
Search duration
The outer while loop is run through at most m times, since i = 0 at the beginning and i is increased by 1 with each step.
Analogous to the phase of the prefix analysis, the inner while loop is run through a maximum of m times.
Since the entire text is run through here too, the runtime is in .
Together
Since prefix analysis and search are performed one after the other, the runtime of the entire algorithm is in . Overall, at most, comparisons are made between characters in the pattern and in the text.
Thus, the algorithm by Knuth, Morris and Pratt may be a better worst-case running time guarantee than the algorithm of Boyer and Moore with .
However, Boyer-Moore can carry out a search under certain circumstances in , Knuth-Morris-Pratt always requires many linear comparisons.
See also
literature
- DE Knuth, JH Morris, VR Pratt (1977): Fast Pattern Matching in Strings . In: SIAM Journal of Computing. 6 (2): 323-350. doi: 10.1137 / 0206024
- Yu. V. Matiyasevich : Real-time recognition of the inclusion relation . In: Journal of Soviet Mathematics . tape 1 , no. 1 , 1973, p. 64-70 , doi : 10.1007 / BF01117471 .