Boyer-Moore algorithm

from Wikipedia, the free encyclopedia

The Boyer-Moore algorithm is a string matching algorithm . The algorithm is used to find a certain partial text (pattern M ) in a text T and was developed in 1977 by Robert S. Boyer and J Strother Moore .

algorithm

The pattern is written left-justified under the text at the beginning and then compared with the text from right to left character by character. As soon as a mismatch occurs, two heuristics calculate how far the search pattern can be shifted to the right.

Bad character heuristic
If, when comparing the pattern with the text from right to left, a character in the pattern does not match the character in the text (" bad character "), the pattern is searched for the last occurrence of this bad character and the pattern is shifted so far until both letters are on top of each other. If this bad character does not exist in the pattern, the pattern is shifted so that its first character is below the successor of the bad character . It can happen that the bad character heuristic suggests shifting the pattern to the left. In this case it is shifted one position to the right.
Good suffix heuristic
If, when comparing the pattern with the text from right to left, a suffix of the pattern matches the text and then a mismatch occurs, the pattern is shifted to the right until a partial word of the pattern fits the suffix again. If the suffix does not exist a second time in the pattern, the pattern is shifted to the right by its full length.

It happens that the two heuristics calculate different shifts. The algorithm always chooses the maximum of the two proposals in order to shift the pattern to the right.

In order to make the procedure efficient, a jump table is calculated for each of the two heuristics in a preprocessing step. The jump table for the bad character heuristic contains the distance from the position of the last occurrence in the search pattern to the end of the search pattern for each character occurring in the search pattern. The table for the good suffix heuristic contains the distance from the end of the pattern from which it appears again in the pattern for each partial pattern (viewed from the rear).

Examples

Bad character heuristic

String: Hoola-Hoola girls like Hooligans

Search pattern: hooligan

Hoola-Hoola girls like Hooligans
Hooligan

The last character does not match the last of the search pattern, so you can move the search pattern to the first "o" (of the search pattern, read from the end):

Hoola-Hoola girls like Hooligans
Hooligan
     Hooligan

The last character of the search pattern does not match the text. At the relevant point there is a “g”, which in turn can be found in the pattern in the third position from the right. If the pattern is shifted two characters to the right, the blue "g" "match":

Hoola-Hoola girls like Hooligans
     Hooligan
       Hooligan

The “r” in the string does not appear in the pattern at all. This enables the search string to be shifted by the entire length, since it is impossible here for the pattern to match at one point.

Hoola-Hoola girls like Hooligans
       Hooligan
               Hooligan
                       Hooligan

Pattern was found.

Another example that doesn't affect the last character of the search string. Therefore the entire search string can only be shifted up to one position after the mismatch found.

A. B. R. A. G A. D. A. B. R. A. K A. D. A. B. R. A.
A. B. R. A. K A. D. A. B. R. A.              
          A. B. R. A. K A. D. A. B. R. A.    
              A. B. R. A. K A. D. A. B. R. A.

Skip table for the example above:

A. B. R. K D.
0 2 1 6th 4th

The Boyer-Moore algorithm works most efficiently when it finds a character that does not appear in the search pattern. The bad character rule then applies. This is very likely with a relatively small pattern and large alphabet, which makes it particularly suitable for such a case. In this case the algorithm works with an efficiency of comparisons.

Good suffix heuristic

String: pure super sour super super suupa

Search pattern: supersupe

reinesupersauersupesupersupe
supersupe

Only the last 4 letters match ("supe"). This suffix occurs at the very beginning of the pattern, so you can move the pattern up to there:

reinesupersauersupesupersupe
     supersupe

Only the last letter "e" matches. We can shift the pattern until the next occurrence of "e" in supersupe:

reinesupersauersupesupersupe
          supersupe

Only the last letters "ersupe" match, which no longer appear anywhere else in the pattern. However, the suffix “supe” occurs both at the end of “ersupe” and at the beginning of the pattern.

reinesupersauersupesupersupe
               supersupe

“E” and “r” do not match, we move one position. This case occurs several times in a row:

reinesupersauersupesupersupe
                supersupe
reinesupersauersupesupersupe
                 supersupe
reinesupersauersupesupersupe
                  supersupe
reinesupersauersupesupersupe
                   supersupe

Pattern was found. Together with the "bad character heuristic", the last 3 iterations could be skipped, since we can move to the next "r" in the pattern.

running time

In the following, Landau notation is used to indicate the asymptotic behavior of the runtime. If the Boyer-Moore algorithm only searches for the first occurrence of the pattern, it has a worst-case complexity of . However, if it is used to find all matches in the pattern, the worst-case complexity is . However, this complexity can be reduced back to with an additional rule in the event that the pattern was found in the last step . However, if you use the algorithm for a relatively small pattern and a large alphabet, you get an average case complexity of .

Bad character heuristic

If you only use the bad character heuristic, you still get an average case complexity of , but you have to accept a worst case complexity of . A worst-case example is the text together with the pattern . Here the whole pattern is always compared with the text until a mismatch occurs in the first place. After such a mismatch, the pattern can only be shifted one place to the right (using bad character heuristics).

Sample code in C ++

In practice, the algorithm applies both rules and always uses the rule that allows the pattern to jump the furthest. For the “good suffix rule” and for the “bad sign rule” one creates one at the beginning of the search Jump table (see “charTable” and “offsetTable” in the code).

In the following source code, the “good suffix rule” table (charTable) is created in the (unlikely) worst case in , which can be neglected if the patterns are not too large. The search for the suffix for the “bad sign rule” table can be done, for example, using the KMP algorithm, which is avoided here for the sake of clarity. So the following algorithm is in .

If one can output the number of required comparisons, then these are surprisingly few for a large alphabet and the Boyer-Moore algorithm, for example, is preferable to the Knuth-Morris-Pratt algorithm .

#include <algorithm>
#include <iostream>
#include <limits>
#include <string_view>
#include <vector>

using namespace std;

/**
 * Ist needle[position:end] ein Präfix von needle?
 */
bool isPrefix(string_view needle, int position) {
    for (int i = position, j = 0; i < needle.size(); ++i, ++j)
        if (needle[i] != needle[j])
            return false;

    return true;
}

/**
 * Gibt die maximale Länge der Teilzeichenfolge zurück, die bei position endet
 * und ein Suffix ist. (Guter-Suffix-Regel)
 */
int suffixSize(string_view needle, int position) {
    int size = 0, i = position, j = needle.size() - 1;

    while (i >= 0 and needle[i] == needle[j]) {
        --i;
        --j;
        ++size;
    }

    return size;
}

/**
 * Erstellt die Sprungtabelle basierend auf den nicht übereinstimmenden
 * Zeicheninformationen.
 */
vector<int> makeCharTable(string_view needle) {
    vector<int> table(numeric_limits<char>::max());

    for (int &value : table)
        value = needle.size();

    for (int i = 0; i < needle.size() - 2; ++i)
        table[needle[i]] = needle.size() - 1 - i;

    return table;
}

/**
 * Erstellt die Sprungtabelle basierend auf dem Scanoffset, bei dem eine
 * Nichtübereinstimmung auftritt. (Schlechtes-Zeichen-Regel)
 */
vector<int> makeOffsetTable(string_view needle) {
    vector<int> table(needle.size());
    int lastPrefixPosition = needle.size();

    for (int i = needle.size(); i > 0; --i) {
        if (isPrefix(needle, i))
            lastPrefixPosition = i;

        table[needle.size() - i] = lastPrefixPosition - i + needle.size();
    }

    for (int i = 0; i < needle.size() - 1; ++i) {
        int size = suffixSize(needle, i);
        table[size] = needle.size() - 1 - i + size;
    }

    return table;
}

/**
 * Gibt den Index innerhalb der Zeichenfolge des ersten Auftretens vom
 * spezifizierten Teilstring zurück. Wenn es sich nicht um einen Teilstring
 * handelt, wird -1 zurückgegeben.
 */
int indexOf(string_view haystack, string_view needle) {
    if (needle.size() == 0)
        return 0;

    vector<int> charTable = makeCharTable(needle);
    vector<int> offsetTable = makeOffsetTable(needle);

    for (int i = needle.size() - 1, j; i < haystack.size();) {
        for (j = needle.size() - 1; needle[j] == haystack[i]; --i, --j)
            if (j == 0)
                return i;

        i += max(offsetTable[needle.size() - 1 - j], charTable[haystack[i]]);
    }

    return -1;
}

int main() {
    string_view haystack = "ttabarbsxfarbbarb";
    string_view needle = "arbbarb";
    cout << "found at position " << indexOf(haystack, needle) << endl;

    return 0;
}

literature

Web links