Rabin-Karp algorithm

from Wikipedia, the free encyclopedia

The Rabin-Karp algorithm is a text search algorithm that was developed by Michael O. Rabin and Richard M. Karp . The algorithm looks for a pattern, e.g. B. a character string , within another character string with the help of hash values . The algorithm is not widely used in single pattern recognition, but it is of considerable theoretical importance and very effective for searching for a pattern several times in a text. For a text of length n and a pattern of length m , its average and best running time is O ( n ), the (very atypical) run time in the worst case ( worst case run time ) is O ( nm ). This is one of the reasons why the algorithm is not used often. Nevertheless, it has the unique advantage of being able to find any number of strings in one run on average in O ( n ) or less.

One of the simplest practical applications of Rabin-Karp is the detection of plagiarism : Take as an example a student who is giving a presentation on Moby Dick . A professor could automatically compile a list of all sentences from various sources on the subject of Moby Dick . Rabin-Karp could now quickly search the submitted document and find every occurrence of one of the sentences in the source material. In order to avoid simply slowing down the system with small changes to the original sources, details such as B. punctuation can be removed beforehand and thus ignored. Due to the number of strings searched for, single-string search algorithms would be impractical in this case.

Other algorithms to search for substrings

The basic problem that the algorithm refers to is finding a fixed string of length m , called a pattern, within a string of length n . An example would be to find the word "sun" in the sentence "Hello sunshine in the valley of tears." A very simple algorithm (see the following listing) for this case would simply search for the character string in all possible positions:

 1 function NaiveSuche(string s[1..n], string sub[1..m])
 2     for i from 1 to n
 3         for j from 1 to m
 4             if s[i+j-1] ≠ sub[j]
 5                 springe zur nächsten Iteration der äußeren Schleife
 6         return i
 7     return nicht gefunden

This algorithm works well in many practical cases, but fails in some examples. The following case is cited as a typical example: Searching for a pattern consisting of 10,000 "a" s followed by a "b" in a character string that consists of 10 million "a" s. In this case the algorithm would reach its worst case Θ ( mn ) .

The Knuth-Morris-Pratt algorithm reduces this to Θ ( n ) by comparing each individual letter only once through precalculations. The Boyer-Moore algorithm not only jumps forward by one letter, but by as many letters as possible in order to make the search successful. So effectively it reduces the number of iterations of the outer loop so that the number of letters that are examined becomes small. In the best case, it can then be n / m . The Rabin-Karp algorithm, however, focuses on speeding up lines 3 to 6 of the above example.

Accelerate the search using hash values

Instead of reducing the number of comparisons through more sophisticated methods, the Rabin-Karp algorithm accelerates the comparison between patterns and partial strings by using a hash function . Rabin-Karp exploits the idea that the same pieces of text also have the same hash values. The idea is therefore to calculate the hash value of the character string you are looking for and then to look for a partial character string with the same hash value.

There are two points to consider. First, the same hash values ​​do not necessarily come from identical strings. If they match, the strings themselves have to be compared, which can take a certain amount of time for long strings. Fortunately, a good hash function also ensures largely different hash values ​​for different meaningful inputs, so that this does not significantly increase the average search time.

The algorithm would look like this:

 1 function RabinKarp(string s[1..n], string sub[1..m])
 2     hsub := hash(sub[1..m])
 3     hs := hash(s[1..m])
 4     for i from 1 to n - m + 1
 5         if hs = hsub
 6             if s[i..i+m-1] = sub
 7                 return i
 8         hs := hash(s[i+1..i+m])
 9     return nicht gefunden

Lines 2, 3 and 6 each run in Ω ( m ) . Lines 2 and 3 are only executed once. Line 6 is only executed if the hash value matches. Usually this will only happen a few times. Line 5 is executed n times, but only takes a constant time.

Now comes the second point: Line 8. If the hash value were simply s[i+1..i+m]calculated for each partial text , the time required for this would be Ω ( m ) . Since this is necessary in every loop pass, the algorithm would need Ω ( mn ), just like the simplest search algorithm NaiveSearch listed above .

An optimization is based on the fact that the variable hsalready contains the hash value of s[i..i+m-1]. If a suitable hash function ( rolling hash ) is selected, the next hash value can be calculated from this in constant time. In the simplest case, the individual character values ​​of the partial text are added. Then:

s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

In the worst case or with a bad hash function that results, for example, in a constant, line 6 is executed up to n times in each iteration of the loop. Since this takes Ω ( m ) time, the entire algorithm still requires the worst-case runtime Ω ( mn ).

Hash functions used

The key to Rabin-Karp's execution is the efficient calculation of the hash values ​​of the successive partial texts of the text. A popular and effective rolling hash function processes each partial text as a number to a base. This base is usually a large prime number . For example, if a partial text is "hi" and the base is 101 , the hash value would be 104 × 101 1 + 105 × 101 0 = 10609 ( ASCII of "h" is 104 and of "i" is 105).

For example, if we have a text “abracadabra” and are looking for a pattern of length 3, we can calculate the hash value of “bra” from the hash value of “abr” (the previous part of the text) by adding the number that was added for the first “a” of “abr”, e.g. B. 97 × 101 2 (97 is ASCII for 'a' and 101 is the base we use), multiply by the base and add for the last a of "bra", e.g. E.g. 97 × 101 0 = 97. If the strings are long, this algorithm will be very profitable compared to many other hash schemes.

Theoretically, there are other algorithms that do suitable calculations, for example multiplying the ASCII values ​​of all letters so that shifting the character string would only involve dividing by the first letter and multiplying by the last letter. The limit for this would be the maximum size of an integer and the condition to use modulo calculations to keep the hash values ​​small.

Rabin-Karp and multiple pattern searches

Rabin-Karp is inferior to Knuth-Morris-Pratt , Boyer-Moore and other faster single pattern text search algorithms because of its slow worst-case behavior in single pattern search. Still, Rabin-Karp is a good algorithm for multiple pattern searches.

If we look at every occurrence of a large number of fixed pattern lengths in a text, e.g. For example , looking for k , we can create a simple variant of Rabin-Karp, which uses a hash table, or any other suitable data structure. This checks whether a hash value of a given text sequence matches a collection of hash values ​​of patterns that we are looking for.

 function RabinKarpSet(string s[1..n], set of string subs, m) {
     set hsubs := emptySet
     for each sub in subs
         insert hash(sub[1..m]) into hsubs
     hs := hash(s[1..m])
     for i from 1 to n
         if hs ∈ hsubs
             if s[i..i+m-1] = a substring mit hash hs
                     return i
         hs := hash(s[i+1..i+m])
     return nicht gefunden
 }

We assume that all strings have a fixed length m . However, this assumption can be eliminated by simply comparing the current hash value with the hash values ​​of all strings. At the same time we do a quick lookup in our data structure and then check every hit we find with all strings with this hash value.

Other algorithms can look for a single pattern in O ( n ), so they will look for k patterns in O ( nk ). The Rabin-Karp variant given above is still in O ( n ) in the best and in the average case, since a hash table in O (1) allows to look up whether a partial text hash equals a sample hash or not. We can also assume O ( mn log k ) in the worst-case scenario , if m is the length of the longest of the k -texts, by storing the hash values ​​in an AVL tree instead of in a hash table.

Individual evidence

  1. ^ Karp and Rabin's original paper: Karp, Richard M .; Rabin, Michael O. (March 1987). "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development 31 (2), 249-260 doi: 10.1147 / approx . 312.0249 .