Aho-Corasick algorithm

from Wikipedia, the free encyclopedia

The Aho-Corasick algorithm is a string matching algorithm that was developed by Alfred V. Aho and Margaret J. Corasick in 1975 .

The algorithm performs a kind of dictionary comparison in which the input is efficiently searched for previously defined signatures. No character of the input is read more than once. The process is particularly efficient when the signatures overlap, ie are in prefix and suffix relationships (e.g. {"hedgehog", "money", "Eldorado"}). The Aho-Corasick algorithm consists of two phases:

  1. First, the algorithm generates a trie structure based on the given or desired dictionary set , which can also be interpreted as a finite state machine . Each state is assigned an output quantity that is initially empty. For each state in which a search term ends, this search term is added to its output set.
  2. In the second phase, the nodes or states of the tries are numbered disjointly in pairs and an error function is calculated. The function is defined for each node and determines the state in which processing is continued if the character of the input that has just been read does not cause a regularly valid state transition. This makes it possible to quickly switch between already recognized partial signatures and to continue the recognition without having to restart the machine.

example

Visualization of a pattern-recognizing machine

Simply put, the algorithm builds a finite state machine and compares it with the input text. If the signature is known in advance (for example, an anti-virus database), the structure can also before starting the program off-line will be made and stored for later use.

The Aho-Corasick algorithm is the basis of the UNIX command fgrep , the intrusion detection system Snort and the web application firewall ModSecurity .

Web links

swell

  1. ^ Alfred V. Aho, Margaret J. Corasick: Efficient string matching: An aid to bibliographic search . In: Communications of the ACM . tape 18 , no. 6 June 1975, p. 333-340 , doi : 10.1145 / 360825.360855 (English).
  2. grep: use Aho-Corasick algorithm to search multiple fixed words. In: grep: Git repository. June 2, 2016, accessed December 17, 2017 .
  3. ^ Adir Gabai: Integrating Aho-Corasick based algorithm for Compressed Traffic (ACCH) Inside Snort . 2015 (English, deepness-lab.org [PDF]).
  4. Breach Security Inc .: ModSecurity Reference Manual