Knuth-Morris-Pratt algorithm

from Wikipedia, the free encyclopedia

The Knuth-Morris-Pratt algorithm was named after Donald Ervin Knuth , James Hiram Morris and Vaughan Ronald Pratt and is a string matching algorithm . Its asymptotic running time is linear in the length of the pattern (also search term , search mask ), which is searched for, plus the length of the searched text.

description

The Morris-Pratt algorithm is based on the naive search algorithm . The main difference is that the comparison window is not always moved forward by just one position, but possibly by more than one position.

To do this, the search mask must initially be analyzed for character strings that are as long as possible a prefix of the pattern itself. The algorithm then creates a table that contains the length of the real edge of the prefix for each prefix of length j , i.e. the maximum number of characters within the prefix of the search pattern that are both suffix and prefix of the same. It is also defined that the length of the real edge for a prefix of length zero is −1. This will be useful later in the algorithm when searching.

The following example illustrates what has been said:

length
Position: 0 1 2 3 4th 5 6th 7th 8th
Template: a b a b c a b a b 9
Prefix (0..0): 0
Prefix (0..1): 0
Prefix (0..2): a a 1
Prefix (0..3): a b a b 2
Prefix (0..4): 0
Prefix (0..5): a a 1
Prefix (0..6): a b a b 2
Prefix (0..7): a b a a b a 3
Prefix (0..8): a b a b a b a b 4th

During the search, the procedure is the same as with the naive search: The characters in the text and search mask are compared at position 0 until two characters no longer match or the search mask is found. If the search mask was found, the algorithm is complete. If the characters do not match before a complete hit, the search mask is shifted to the right by the difference between the number of matching characters and the length of the edge of the prefix. This is where the previous definition of the length of the edge of a prefix of length zero comes into play: the difference between 0 and −1 is 1, so if the character does not match, the first position is shifted by one to the right. In addition, the search for the length of the edge of the prefix or zero, if the edge is shorter than zero, begins further to the right.

With each partial match, for example the first k symbols, it is now known whether the beginning of the search mask matches the end of the last matching partial mask. The search mask is moved after the overlapping match; An additional advantage is that the symbols that have already been compared do not have to be compared again.

The information obtained in this way is used during the search in order to avoid repeated comparisons.

Consequently, the algorithm is divided into two phases, namely

  1. the prefix parsing, which parses the pattern itself, and
  2. the actual search.

search

As an example we are looking for the pattern “ababcabab” in the text “abababcbababcababcab ...”.

As with the Naive Algorithm, the pattern is written left-justified below the text and the letter pairs are compared from left to right until the pattern and text no longer match (an error occurs ):

Position: 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b a b

The first error is found at position 4 in the text. If we look at the previously calculated table with the length of the edges of a prefix at the position "Prefix (0..3)", we see that the length 2 is stored here. The pattern is shifted 2 characters to the right so that it overlaps appropriately with the suffix (ie the "second part" of the border); in addition, the search continues directly for the edge, since we already know that the two parts match (this is the great strength of the KMP algorithm):

Position: 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b a b

The algorithm continues with the comparison at position 4, i.e. exactly where an error was previously found. In particular, the "from" margin is not checked again.

Position: 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b a b

The next error occurs at position 7 in the text, and thus at position 5 in the search pattern. We look again at our table at "Prefix (0..4)", it says that there are no characters here that form a border (length zero). So we can now be sure that there are no hits. We searched the characters up to position 7 by naively sliding one character to the right. Therefore, the pattern can be moved to under 7 in the text, the result of search text position + (number of matching characters - margin length (prefix length)), i.e. 2 + (5 - 0) = 7:

Position: 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b a b

The algorithm then continues with the comparison again at position 7.

Position: 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b a b

Sometimes, as here, an error occurs in the first position of the pattern. In this case the pattern is shifted 1 to the right; Search text position + (number of matching characters - margin length (prefix length)), i.e. 7 + (0 - (−1)) = 8, and the algorithm continues with comparisons at the next position in the text (i.e. 8).

Position: 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b a b

If all characters of the pattern were found in the text, a hit is output.

Since the last checked four characters "abab" at positions 13 to 16 are a prefix of length 4, the pattern is moved to position 13. The comparison is continued again at position 17 (= 13 + 4):

Position: 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b ...

The algorithm ends as soon as the pattern cannot be shifted further to the right without protruding beyond the end of the text, or as soon as the end of the text is reached.

Observations

  1. The text is run through exactly once.
  2. The algorithm either moves further to the right in the text, or it stops in the text and shifts the pattern.
  3. If the pattern is to be shifted and there is a prefix of length k in front of the character just checked , the pattern is shifted so far that the first k characters are not checked again.

Prefix analysis

In order to find all longest prefixes in the pattern, the pattern is analyzed before the actual search.

To do this, write −1 under the first character in the pattern, and under each additional character the number of characters immediately preceding that form a prefix of the pattern without starting at the beginning of the pattern.

As an example, we will process the pattern "ababcabab" again:

template value comment
a −1 Special case at the beginning of the pattern.
b 0 Without starting at the beginning of the pattern, there are no preceding characters.
a 0 The preceding characters without starting at the beginning of the pattern are “b”. This is not a prefix of the pattern.
b 1 The preceding characters without starting at the beginning of the pattern are “ba”. Only the "a" is also the prefix of the pattern.
c 2 The preceding characters without starting at the beginning of the pattern are "bab". Only the "ab" is also the prefix of the pattern.
a 0 The preceding characters without starting at the beginning of the pattern are “babc”. The prefix “ab” is included, but because of the “c” it is not directly in front of this “a”.
b 1 The preceding characters without starting at the beginning of the pattern are “babca”. Only the "a" is also the prefix of the pattern.
a 2 The preceding characters without starting at the beginning of the pattern are "babcab". Only the "ab" is also the prefix of the pattern.
b 3 The preceding characters without starting at the beginning of the pattern are “babcaba”. Only the "aba" is also the prefix of the pattern.
4th The last characters of the pattern without starting at the beginning are “babcabab”. Only the "abab" is also the prefix of the pattern.

This results in the following prefix table for the “ababcabab” pattern . Note that the last line of the table is one field longer than the pattern.

Prefix table for "ababcabab"
Position: 0 1 2 3 4th 5 6th 7th 8th 9
Template: a b a b c a b a b
Value: −1 0 0 1 2 0 1 2 3 4th

For comparison, the table above:

length
Position: 0 1 2 3 4th 5 6th 7th 8th
Template: a b a b c a b a b 9
Prefix (0..0): 0
Prefix (0..1): 0
Prefix (0..2): a a 1
Prefix (0..3): a b a b 2
Prefix (0..4): 0
Prefix (0..5): a a 1
Prefix (0..6): a b a b 2
Prefix (0..7): a b a a b a 3
Prefix (0..8): a b a b a b a b 4th

implementation

The following is the algorithm described in pseudocode .

Be input

  • a text t of length m , and
  • a pattern w of length n .

For every occurrence of the pattern w in the text t , the starting position of the word in the text should be output.

We regard patterns and text as an array , which are numbered starting with zero. So z. B. w [0] the first character of the pattern, and t [8] the ninth character of the text. It is common practice to start numbering with 0 .

Prefix analysis

First, the prefix analysis is carried out. It creates the prefix table discussed above , here only the last line as an array N , which contains the length of the directly preceding prefix for each character of the pattern.

Input is

  • a pattern w of length n .

Issue is

  • the array N of length n + 1 .

in pseudocode:

 i := 0       // Variable i zeigt immer auf die aktuelle Position
 j := -1      // im Muster,  Variable j gibt die Länge des gefun-
              // denen Präfixes an.

 N[i] := j       // Der erste Wert ist immer -1

 while i < n     // solange das Ende des Musters nicht erreicht ist
 |
 |   while j >= 0 and w[j] != w[i]   // Falls sich ein gefundenes
 |   |                               // Präfix nicht verlängern lässt,
 |   |   j := N[j]                   // nach einem kürzeren suchen.
 |   |
 |   end
 |
 |           // an dieser Stelle ist j=-1 oder w[i]=w[j]
 |
 |   i := i+1                // Unter dem nächsten Zeichen im Muster
 |   j := j+1                // den gefundenen Wert (mindestens 0)
 |   N[i] := j               // in die Präfix-Tabelle eintragen.
 |
 end

The same code in Python:

def prefix(muster, laenge):
    # Laenge des gefundenen Prefix
	laengePrefix = -1

	# Der erste Wert ist immer -1
	prefixTabelle = [laengePrefix]

	# solange das Ende des Musters nicht erreicht ist
	for positionImMuster in range(0, laenge):
		# solange sich ein gefundenes Prefix nicht verlaengern laesst, nach einem kuerzeren suchen
		while laengePrefix >= 0 and muster[laengePrefix] is not muster[positionImMuster]:
			laengePrefix = prefixTabelle[laengePrefix]

		# an dieser Stelle ist laengePrefix == -1 oder muster[positionImMuster] == muster[leangePrefix]
		laengePrefix = laengePrefix + 1
		# den gefundenen Wert an die prefixTabelle anhaengen
		prefixTabelle.append(laengePrefix)

	return prefixTabelle

search

The second phase is the search. Since the pattern is of course not actually written under the text and then moved, two variables i and j are used. I indicates the current position in the text and j the current position in the pattern. The meaning is that position j of the pattern is always “under” position i of the text.

Example for i = 5 and j = 3 :
i : 0 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th 16 17th 18th 19th ...
Text: a b a b a b c b a b a b c a b a b c a b ...
Template: a b a b c a b a b
j : 0 1 2 3 4th 5 6th 7th 8th

Are input

  • the pattern w of length n ,
  • the array N of length n + 1 from the first phase, and
  • a text t of length m .

Output are

  • all positions where w occurs in t .
 i := 0   // Variable i zeigt immer auf die aktuelle Position im Text.
 j := 0   // Variable j auf die aktuelle Position im Muster.

 while i < m   // also Textende nicht erreicht
 |
 |   while j >= 0 and t[i] != w[j]      // Muster verschieben, bis
 |   |                                  // Text und Muster an Stelle
 |   |   j := N[j]                      // i,j übereinstimmen. Dabei
 |   |                                  // Array N benutzen.
 |   end
 |
 |   i := i + 1              // In Text und Muster je eine
 |   j := j + 1              // Stelle weitergehen.
 |
 |   if j == n then          /// Ist das Ende des Musters erreicht
 |   |                       // einen Treffer melden. Dieser begann
 |   |   print i - n         // bereits n Zeichen früher.
 |   |
 |   |   j := N[j]           // Muster verschieben.
 |   |
 |   end if
 |
 end

The same code in Python:

def suche(muster, prefixTabelle, text):
	positionImMuster = 0
	musterLaenge = len(muster)

	# solange das Ende des Textes nicht erreicht ist
	for positionImText in range(0, len(text)):
		# Muster verschieben bis Text und Muster an Stelle (positionImText, positionImMuster) uebereinstimmen
		while positionImMuster >= 0 and text[positionImText] is not muster[positionImMuster]:
			# Dazu wird die Prefix-Tabelle verwendet
			positionImMuster = prefixTabelle[positionImMuster]

		positionImMuster = positionImMuster + 1

		# falls das Ende des Musters erreicht ist
		if positionImMuster is musterLaenge:
			# einen Treffer melden. Der Treffer beginnt bereits musterLaenge Zeichen frueher.
			print(positionImText - musterLaenge)
			# Muster verschieben
			positionImMuster = prefixTabelle[positionImMuster]

Runtime analysis

As usual, the running time is given in Landau notation .

Runtime of the prefix analysis

The outer while loop is run through at most n times, since i = 0 at the beginning and i is increased by 1 with each step.

In the inner while loop, j is set to a previously calculated, smaller value of j , stored in N [j] , on each run . In total, this loop is run through at most as often as j was increased. Since j is only increased in the outer loop, the inner loop is iterated at most n times in total .

However, the entire pattern must also be run through. Therefore the runtime of the prefix analysis is in .

Search duration

The outer while loop is run through at most m times, since i = 0 at the beginning and i is increased by 1 with each step.

Analogous to the phase of the prefix analysis, the inner while loop is run through a maximum of m times.

Since the entire text is run through here too, the runtime is in .

Together

Since prefix analysis and search are performed one after the other, the runtime of the entire algorithm is in . Overall, at most, comparisons are made between characters in the pattern and in the text.

Thus, the algorithm by Knuth, Morris and Pratt may be a better worst-case running time guarantee than the algorithm of Boyer and Moore with .

However, Boyer-Moore can carry out a search under certain circumstances in , Knuth-Morris-Pratt always requires many linear comparisons.

See also

literature

Web links