Suffix tree

from Wikipedia, the free encyclopedia
Suffix tree for abbabbab , leaves are labeled with the start indices (1-based) of the corresponding suffixes, $ marks the end of a suffix

A suffix tree is a versatile data structure that enables efficient solutions to numerous string processing problems . A suffix tree stores all suffixes (endings) of a character string. It can be set up in linear time with linear memory consumption and, once set up, enables many operations to be carried out quickly, such as searching for words or sentences in longer texts.

definition

A suffix tree T for a string S with m symbols is a directed tree with m leaves . Each edge is labeled with a substring of S. Every inner node of T has at least two children whose edge labels never begin with the same symbol. For each leaf i in T , the labels of the edges on the path from the root to i, linked together, result in the suffix of S , which begins at index i . Thus, T contains all suffixes of S , whereby sub-strings that occur multiple times are only contained once in T (see figure).

construction

The tree initially consists of a root node and a list of (pointers to) all of the continuable locations in the tree, that is to say initially of just a pointer to the root node. The tree is built for a word that is gradually extended. An edge is drawn to a new node for all nodes from the list of positions that can be continued. This edge is labeled with the last letter added to the word. This new node is then placed on a list of continuable locations for the next iteration. Finally, the start node is always included in the new list (since the empty word is always a valid suffix). If an edge already exists for a continuable position, the labeling of which corresponds to the letter added last, no node is added and the existing target node is added to the list of continuable positions instead.

By adding the start node in each step, the list of places that can be continued after n  iterations is also n +1 elements long, which results in quadratic runtime. There is a linear running time algorithm based on the suffix array .

Applications

Thus, a suffix tree enables the efficient solution of numerous string processing problems . The construction of the tree is possible with a linearly increasing runtime and a linearly increasing storage space requirement. For example, in the area of ​​the exact string comparison in a text of length m, after preprocessing in O ( m ), a suffix tree enables the search for k patterns of length n with a runtime complexity of only O ( k + n ). A suffix tree also enables efficient procedures in the area of ​​soft string comparison, for example when matching with wildcards or k-mismatch (see Gusfield 1999: 199 ff.) Or in the form of accelerating the determination of the Levenshtein distance using hybrid dynamic programming (see Gusfield 1999: 279 ff.).

For a detailed description and more than 20 other uses of suffix trees see Gusfield (1999: 89ff.).

history

  • Morrison (1968): Patricia-Trie .
  • Weiner (1973): First algorithm for the construction of suffix trees with linearly increasing runtime.
  • McCreight (1976): Further development with higher storage efficiency.
  • Ukkonen (1995): Conceptually new, simpler algorithm with runtime and storage space complexity O (n), also enables an on-line construction of the tree.
  • Giegerich & Kurtz (1995): Accelerated algorithm using functional language and lazy evaluation: construction while searching, only relevant parts of the tree are created if they do not yet exist.

See also

literature

  • Dan Gusfield: Algorithms on Strings, Sequences and Trees . 1999, ISBN 978-0-521-58519-4 (English, first edition: 1997).
  • Hans-Joachim Böckenhauer and Dirk Bongartz: Algorithmic Basics of Bioinformatics . 2003, ISBN 978-3-519-00398-4 .

Web links

Commons : Suffix tree  - collection of images, videos and audio files