String matching algorithm

In computer science are string matching algorithms , a group of algorithms , the finding of text segments in a string ( English string ) based on a predetermined search pattern described. They therefore belong to the class of character string algorithms .

In a narrower sense, these search algorithms for exact matches ( English matches ). In a broader sense, algorithms are also meant that allow approximate matches, with the term roughly having to be precisely defined by a tolerance criterion .

The problem is to solve this task as efficiently as possible. In practice, this is important if search terms are to be found in large amounts of text (such as Wikipedia).

Exact search

Problem

There are two basic situations:

Any text should be searched according to a search mask .

The text is given, and then any search mask should be found in the text.

The second case roughly corresponds to the task of preparing Wikipedia in such a way that any search mask can be found quickly and efficiently. Even search engines on the Internet can be found in the second situation.

However, only the first situation is discussed below.

Solution methods

Naive algorithm

The simplest algorithm consists in sliding a so-called search window the length of the search mask over the text. In every position of the search mask, the symbols of the mask are compared with those of the text below. If a mismatched symbol is found, the window is moved one position and a comparison is made again; if all symbols in the window match, the search mask has been found. The algorithm ends when all of the text in the window has been searched.

This algorithm has a running time of the order if m is the length of the search mask and n is the length of the text. ${\ displaystyle \ textstyle {\ mathcal {O}} (n \ cdot m)}$

Pseudocode:

Eingabe: Strings T = T₁… T_n und P = P₁ … P_m
Ausgabe: q die Stellen an denen P in T auftritt

  for q = 0 to n – m do
    if P[1] = T[q+1] and P[2] = T[q+2] and … and P[m] = T[q+m] then
      write q

Surprisingly, the naive approach is very quick in practice, as errors in natural language texts appear after 1 to 2 characters. For the English language there is a probability of 1.07 characters. Thus the naive approach is almost linearly fast.

This is also clear when you look at the worst case for yourself. It reads

Text:   aaa...aab
Muster: ab

Such cases are extremely unlikely in natural language texts.

Finite automaton

A finite automaton to find the word anax

In the string matching algorithm with the help of finite automata , an automaton with a set of states is created for an alphabet and a given search pattern of length . It shows the number of matching letters at the current position from the beginning of the search pattern. For simplicity, the prefix of the search pattern up to and including the letter is at the position . The transition function with now returns a state in which represents the maximum number of letters with which a suffix of the word is. So . If the search pattern is found, the final state remains, that is . ${\ displaystyle \ Sigma}$ ${\ displaystyle m}$ ${\ displaystyle (Q, \ Sigma, \ delta, q_ {0}, \ {q_ {m} \})}$ ${\ displaystyle \ textstyle Q = \ {q_ {i} \ mid 0 \ leq i \ leq m \}}$ ${\ displaystyle \ textstyle i}$ ${\ displaystyle \ textstyle P_ {i}}$ ${\ displaystyle \ textstyle i}$ ${\ displaystyle \ delta (q_ {i}, a)}$ ${\ displaystyle a \ in \ Sigma}$ ${\ displaystyle 0 \ leq i \ leq m-1}$ ${\ displaystyle q_ {j}}$ ${\ displaystyle j}$ ${\ displaystyle P_ {j}}$ ${\ displaystyle P_ {i} a}$ ${\ displaystyle \ delta (q_ {i}, a) = q_ {max \ {j \ mid P_ {j} {\ text {is a suffix of}} P_ {i} a \}}}$ ${\ displaystyle \ delta (q_ {m}, a) = q_ {m}}$

The advantage of this algorithm over the naive algorithm is that it does not discard the knowledge gained about the part of the character string that has already been processed when a non-matching character is found. Suppose we are looking for the pattern anax in the text ananax . If the machine-based algorithm encounters the second n in ana n ax when searching , it will discard the first two letters and continue searching, starting with an an ax . The naive algorithm, on the other hand, would have discarded the entire already processed part and would have started a next attempt starting with a n anax .

Python implementation

def is_suffix(suffix, word):
    '''Überprüft ob das suffix ein Suffix von word ist.'''

    return word.endswith(suffix)

def transition(pattern, state, event):
    '''Hier wird die Übergangsfunktion berechnet.'''

    for k in range(state + 1, 0, -1):
        if is_suffix(pattern[:k], pattern[:state] + event):
            return k
    return 0

def create_matcher(alphabet, pattern):
    '''Erzeugt alle Zustände und eine Übergangsfunktions-Tabelle'''

    transition_table = {}
    for state in range(0, len(pattern) + 1):
       for event in alphabet:
            transition_table[(state, event)] = \
                transition(pattern, state, event)
    return transition_table, len(pattern)

def match(matcher, text):
    '''Gibt die gefundenen Treffer im Text mit dem Automaten der aus create_matcher
    erstellt wurde.'''

    transition_table, last_state = matcher
    matches = []
    state = 0
    text_pos = 0
    for text_pos in range(0, len(text)):
        state = transition_table[(state, text[text_pos])]
        if state == last_state:
            matches.append(text_pos - last_state + 1)
    return matches

def find(alphabet, pattern, text):
    matcher = create_matcher(alphabet, pattern)
    return match(matcher, text)

The Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt algorithm is based on the naive search algorithm. The main difference is that the comparison window is not always moved forward by just one position, but possibly by more than one position.

For this purpose, the search mask must be analyzed at the beginning so that with every partial match, for example the first k symbols, it is known whether the start of the search mask matches the end of the last matching partial mask. The search mask is moved after the overlapping match; An additional advantage is that the symbols that have already been compared do not have to be compared again.

Search in the suffix tree

The construction of a suffix tree is particularly useful if the text to be searched is known in advance and you want to search for many different patterns in it later . This construction can be done in. Each pattern can then be searched for without having to prepare the text again in : If it is available, the corresponding node can be reached from the source of the suffix tree, otherwise the search fails (there is no corresponding node). ${\ displaystyle \ textstyle {\ mathcal {O}} (n)}$ ${\ displaystyle \ textstyle {\ mathcal {O}} (m)}$

Overview

algorithm	Preparation time	Search time
Naive algorithm	${\ displaystyle 0}$ (no)	${\ displaystyle \ Theta \ left (\ left (n-m + 1 \ right) \ cdot m \ right)}$
Rabin-Karp algorithm	${\ displaystyle \ Theta \ left (m \ right)}$	average , worst ${\ displaystyle \ Theta \ left (n + m \ right)}$ ${\ displaystyle \ Theta \ left (n \ cdot m \ right)}$
Finite automaton	${\ displaystyle {\ mathcal {O}} \ left (m \ cdot \| \ Sigma \| \ right)}$	${\ displaystyle \ Theta \ left (n \ right)}$
Knuth-Morris-Pratt algorithm	${\ displaystyle \ Theta \ left (m \ right)}$	${\ displaystyle \ Theta \ left (n \ right)}$
Boyer-Moore algorithm	${\ displaystyle \ Theta \ left (m \ right)}$	average , worst ${\ displaystyle \ Theta \ left (n / m \ right)}$ ${\ displaystyle \ Theta \ left (n \ cdot m \ right)}$
Shift-Or algorithm	${\ displaystyle \ Theta \ left (m + \| \ Sigma \| \ right)}$	${\ displaystyle \ Theta \ left (n \ right)}$
Search in the suffix tree	${\ displaystyle \ Theta \ left (n \ right)}$	${\ displaystyle \ Theta \ left (m \ right)}$

Where m is the length of the search mask and n is the length of the text.

More algorithms

Skip search algorithm
Baeza-Yates-Gonnet algorithm (Shift-Or or Shift-And)
BNDM (Backward Nondeterministic Dawg Matching)
BOM (Backward Oracle Matching)

Multi-string matching

Searching for multiple patterns in a text is called multi-string matching. Most algorithms are derived from a corresponding string matching algorithm for exactly one pattern. Treating word overlaps is a particular challenge when searching for multiple search terms.

List of algorithms

Multi-string algorithm	matching single-string algorithm
Multi-Shift-And	Shift-And
Ah-Corasick	Knuth-Morris-Pratt
Commentz-Walter	Boyer Moore
Set-Horspool	Horspool
Wu-Manber	Horspool / Rabin-Karp
Set BOM	BOM

Pattern matching search

The search for patterns is somewhere between fuzzy and exact searches, as the user must explicitly specify the scope he allows for certain character classes at certain string positions.

Fuzzy search

In the case of fuzzy searches, the algorithm usually decides how large the deviation from hits may be based on a quality or distance criterion.

This form of search also includes searches for identical words in a text (phonetic search). Examples of algorithms are:

Web links

Java animations that illustrate how almost all exact search algorithms work
StringSearch - high-performance pattern matching algorithms in Java - Implementation of many string matching algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or)
StringsAndChars - Implementations of string matching algorithms for one and more patterns in Java.
simple and detailed explanation of the Boyer-Moore algorithm

Individual evidence

^ Dan Gusfield: Algorithms on Strings, Sequences and Trees . 1997, ISBN 0-521-58519-8 , chapter 7.1.APL1 (1999 corrected edition).
↑ RS Boyer, JS Moore: A fast string searching algorithm . In: Communications of the ACM . 20, 1977, pp. 762-772. doi : 10.1145 / 359842.359859 .
↑ Gonzalo Navarro, Mathieu Raffinot: Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences . 2008, ISBN 0-521-03993-2 .

[1] Dan Gusfield: Algorithms on Strings, Sequences and Trees . 1997, ISBN 0-521-58519-8 , chapter 7.1.APL1 (1999 corrected edition).

[2] RS Boyer, JS Moore: A fast string searching algorithm . In: Communications of the ACM . 20, 1977, pp. 762-772. doi : 10.1145 / 359842.359859 .

[3] Gonzalo Navarro, Mathieu Raffinot: Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences . 2008, ISBN 0-521-03993-2 .