Baeza-Yates-Gonnet algorithm

from Wikipedia, the free encyclopedia

The Baeza-Yates-Gonnet-Algorithm or Shift-or-Algorithm , also known as Shift-and or Bitap , solves the string matching problem by simulating a nondeterministic automaton . Among other things, a modification of this algorithm is used in the Unix tool grep .

Since the implementation can be traced back to bit operations, the algorithm is already very efficient in terms of its execution. If you combine this with the underlying system (one loop over the pattern in preprocessing, one loop over the text during the search), the result is an extremely efficient algorithm.

basis

The algorithm is based on a set of vectors with the following definition:

This clearly means that exactly when, after processing characters in the text, the last characters match the first characters of the search pattern.

A hit for a search pattern with length is found, if .

Furthermore, the characteristic vectors are required for all characters that may appear in the text:

Example:

Search pattern , length

Characteristic vectors:

Sequence (exact matching)

In order to simplify the process, a special bit operation Bitshift or for the vector is introduced first:

The algorithm for exact matches can now be reduced to a few simple steps:

  1. Initialize the vector with 0 (for all positions) and begin with the first character of the text to be searched
  2. Move all bits in one position to the right using a bit shift .
  3. Carry out a combination of and the characteristic vector of the current text character.
  4. Go to the next text character. If end, cancel, otherwise go to (2)

On closer inspection, steps (2) and (3) carry out exactly the calculation rule for : By shifting the "old" character is created in place of the place (corresponds in combination with the condition ). The characteristic vector of the current text character has at the point just then , if patterns and text match here. This links both conditions.

Example (exact matching)

Pattern: ,

Text:

There is a hit at (position - pattern length + correction for first character).

Extension (approximate matching)

The algorithm can carry out a fault-tolerant search with slight modifications. For this, the vector is divided:

  1. : corresponds to the previous one ; The index stands for the number of errors that have occurred.
  2. : Designates a vector that is aimed at hits with at most one error.
  3. ...
  4. : Designates a vector that is aimed at hits with a maximum of errors.

Caution: In the case of the error-prone vectors, the above interpretation with “after j characters the last i match the first i of the pattern” is difficult and no longer necessarily plausible.

The calculation rule for remains unchanged. For error vectors , a distinction is made according to the causing action:

Insert a character in the search pattern

Explanation: The first part of the expression describes when there are already errors, but the current character of the text and pattern match. The second part describes the error case: So far (in ) there were only errors, so the current character can be inserted into the pattern.

Interpretation: if after the input of the last characters at least characters match the search pattern and the rest can be matched by inserting the missing characters.

Deleting a character from the search pattern

Explanation: The first part of the expression describes when there are already errors, but the current character of the text and pattern match. The second part describes the error case: If you do not look at the first characters of the text , but only the first (in the vector the position above), the pattern matches except for errors. The mark of the pattern is then simply deleted.

Replace a character in the pattern

Explanation: The first part of the expression describes when there are already errors, but the current character of the text and pattern match. The second part describes the error case: According to characters, the last characters matched . If you now replace the character in the pattern with the character in the text, the last characters also match the first characters of the "new" pattern after characters .

The variants can be linked by bit or as required.

literature

Web links

  • StringSearch - high-performance pattern matching algorithms in Java (implementations of the Shift-Or algorithm in Java; English)
  • BNDM algorithm (PDF; 289 kB)