Suffix array

from Wikipedia, the free encyclopedia

In computer science, a suffix array is an array that specifies the suffixes of a character string in lexicographical order.

example

Consider the string = of length 11. Since the empty string is also a valid suffix, we add the end marker to the end of in order to be able to express the empty string or the end of . The string with the associated positions of its characters is thus: abracadabra$

i 1 2 3 4th 5 6th 7th 8th 9 10 11 12
a b r a c a d a b r a $

This string has twelve suffixes, which can also be described by their starting position iin :

suffix i
abracadabra$ 1
bracadabra$ 2
racadabra$ 3
acadabra$ 4th
cadabra$ 5
adabra$ 6th
dabra$ 7th
abra$ 8th
bra$ 9
ra$ 10
a$ 11
$ 12

The empty string is lexicographically smaller than any suffix of . Therefore, by convention , the end marker is also lexicographically smaller than any other character in the string. Thus, all suffixes can be sorted lexicographically: $

suffix i
$ 12
a$ 11
abra$ 8th
abracadabra$ 1
acadabra$ 4th
adabra$ 6th
bra$ 9
bracadabra$ 2
cadabra$ 5
dabra$ 7th
ra$ 10
racadabra$ 3

Represented as an array results {$, a$, abra$, …}.

If the original string is known, each suffix can be specified using the index of its first character to save space. The suffix array is an array of these indexes in lexicographical order. For the string abracadabra$, the suffix array is thus = : The suffix "a" or "a $" begins at the eleventh character, "abra" at the eighth character, and so on. {12,11,8,1,4,6,9,2,5,7,10,3}

The end marker $is useful when combining multiple strings. For example, in = it is immediately clear that the text consists of the two strings and . If only one string is considered, this suffix can be ignored: Since the empty string is always sorted lexicographically before all others, no information is lost in this case. abracadabra$mississippi$abracadabramississippi

Algorithms

There are a number of algorithms for constructing a suffix array from a given text of length n . The methods used for suffix sorting can be roughly divided into three classes: Iterative, recursive and induced sorting.

Iterative partial sorting

The simplest way is to use an efficient comparison-based sorting algorithm that requires at most comparisons. Since the comparison of two suffixes takes time, the complete runtime of this procedure is . Further developed algorithms improve this by using the results of partial comparisons and thus avoiding redundant comparisons.

To this end, only the first letter of each suffix is ​​considered and a preliminary suffix array is built from this, which has not yet sorted suffixes with the same first letter among one another. However, suffixes with different initial letters are already included in the correct order. In a second step, each group of suffixes with the same initial letter is sorted so that they are correctly sorted with regard to the first two initial letters. The third step sorts all suffix groups with two identical first letters so that they are correctly sorted with respect to the first four, the fourth step so that they are correctly sorted with respect to the first eight, and so on. After steps, each suffix is ​​completely correctly sorted and the suffix array is completely built up. Each partial step is possible in time , so that the total runtime results.

This class includes the classic suffix array algorithm from Manber and Myers as well as the algorithm by Larsson and Sadakane, which is much more efficient in practice.

Recursive algorithms

This class of algorithms has only been researched since 2003. The characters of the string are divided into two character strings x and y . The algorithm is then called recursively on x . With the suffix array of x calculated in this way , the suffix array of y can be efficiently calculated (“induced”). Since the division into x and y is known, the suffix array of can be read off from it.

By cleverly choosing x and y , most of these algorithms have a running time of . However, because recursion is expensive in practice, it is not always preferable.

Induced sorting

Here too, with an already calculated suffix array of a character string x, the suffix array of a character string y is efficiently calculated (induced). Instead of a recursive working method, the string can be run through several times in different directions. The characters are classified in x or y , partially sorted and further sorted in a later step on the basis of other partial results. The algorithms in this class are very diverse. Almost all of them have in common, however, that they are usually faster than recursive algorithms in practice, despite the often poor worst case runtime . They also generally require significantly less storage space than other algorithms.

The currently fastest algorithm in this class is the suffix-array-induced-sorting algorithm (SAIS for short) by Nong, Zhang and Chan. It only needs a runtime in . The SAIS algorithm also proves to be very fast in practice when various effects such as B. Cache misses are taken into account.

Applications

Once constructed, the suffix array can be used as an index of the text to efficiently perform subsequent operations. These operations include search queries and compression methods.

Exact search queries (string matching)

An exact search query consists of a search pattern that is to be searched for in a text . A text passage only counts as an exact occurrence of if every character in the text passage matches. In this way, these queries differ from the inexact queries in which a specified number of deviating characters is allowed. It is at the exact search queries between decision requests ( "Come in before?"), Number of requests ( "How often in front?") And enumeration requests ( "At what points is in front?"), Depending where decision requests, if necessary by using number requests and number inquiries can be answered with enumeration inquiries.

In order to determine the number of exact occurrences of , all suffixes that begin with must be searched for in the suffix array for . Since the suffix array is sorted, these suffixes are all directly behind each other and form a block. Therefore, it is sufficient to determine the lexicographically smallest and lexicographically largest suffix that begin with . With the aid of the binary search, these can be found in, where in the following stands for the length of and for the length of . The number of occurrences of is then equal to the number of suffixes in this block. This allows the total number of occurrences in to be determined. If, in addition, the output of the starting positions of all exact occurrences is required, the values ​​of the suffix array in the block must be returned instead. The runtime for this is , where stands for the number of occurrences of in .

Refined methods can achieve better runtimes with the help of additional data structures. When the so-called LCP array (longest common prefix) has been determined and a suitable RMQ data structure (range minimum query) has been constructed for the LCP array, decision and number inquiries in and enumeration inquiries can be answered in .

Construction of a suffix tree

The suffix array of a text is often used as an intermediate step in order to construct the associated suffix tree in linear time. The suffix tree can then also be used as an index to answer search queries.

compression

Various compression methods can be implemented efficiently using the suffix array. In this way, the factorization of the LZ77 compression can be implemented in linear time. In addition, the Burrows-Wheeler transformation of the text can be calculated from an existing suffix array with very little effort . To do this, for each suffix of the suffix array, the character that is exactly one position before the suffix in the text is determined and stored in an array. The resulting array is then identical to the Burrows-Wheeler transformation of the text and can be used, for example, for the bzip2 compression method.

history

Suffix arrays were originally developed by Gene Myers and Udi Manber in 1990 to reduce memory consumption compared to suffix trees . After a few years no significant findings emerged, suffix arrays have been a popular research topic since around 2000. Since then, a variety of design algorithms have been developed that have numerous desirable properties.

Algorithms have existed since 1999 that are faster than the existing linear time algorithms in most applications, but in the worst case require a time in the range to . The first guaranteed linear time algorithms (i.e. those with worst case runtime ) have only been known since 2003. It has been known since 2004 that suffix arrays can solve any problem with the same time complexity as suffix trees . Since then, at the latest, suffix arrays have been the method of choice for almost all suffix and string sorting tasks.

From 2005, the efficient storage of suffix arrays was also considered in addition to the design. In addition to pure suffix arrays, compressed suffix arrays can now also be efficiently constructed and used. They are also useful for compressed full-text indexes based on the Burrows-Wheeler transformation .

literature

  • Udi Manber, Gene Myers: Suffix arrays: a new method for on-line string searches . In: SIAM Journal on Computing , Volume 22, Issue 5, October 1993, pp. 935-948.
  • Pang Ko, Srinivas Aluru: Space efficient linear time construction of suffix arrays . In: Combinatorial Pattern Matching (CPM 03) . LNCS 2676, Springer, 2003, pp. 203-210.
  • Juha Kärkkäinen, Peter Sanders: Simple linear work suffix array construction . (PDF; 193 kB) In: Proc. 30th International Colloquium on Automata, Languages ​​and Programming (ICALP '03) . LNCS 2719, Springer, 2003, pp. 943-955.
  • Enno Ohlebusch: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction . Oldenbusch, Bremen 2013, uni-ulm.de
  • Klaus-Bernd Schürmann, Jens Stoye: An incomplex algorithm for fast suffix array construction . (PDF; 204 kB) In: Proceedings of ALENEX , 2005.

Web links

Commons : suffix array  - collection of images, videos and audio files

Individual evidence

  1. a b c Simon J. Puglisi, WF Smyth, Andrew H. Turpin: A Taxonomy of Suffix Array Construction Algorithms . In: ACM Computing Surveys (CSUR) 39.2 (2007)
  2. a b U. Manber, GW Myers: Suffix arrays: A new method for on-line string searches . In Proceedings of the 1st ACM-SIAM Symposium on Discrete Algorithms (1990)
  3. a b J.N. Larsson, K. Sadakane: Faster suffix sorting . In tech rep. LU-CS-TR: 99-214 , Dep. of Computer Science, Lund University, Sweden (1999)
  4. ^ Nong Ge, Sen Zhang, Wai Hong Chan: T wo Efficient Algorithms for Linear Time Suffix Array Construction . In: IEEE Transactions on Computers 60 , no.10 (October 2011), pp. 1471-1484.
  5. Nataliya Timoshevskaya, Wu-chun Feng: SAIS-OPT: On the characterization and optimization of the SA-IS algorithm for suffix array construction . In: 2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) , 2014.
  6. Enno Ohlebusch: Bioinformatics Algorithms. Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen 2013, ISBN 978-3-00-041316-2 , p. 116.
  7. Enno Ohlebusch: Bioinformatics Algorithms. Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen 2013, ISBN 978-3-00-041316-2 , pp. 120-125.
  8. Enno Ohlebusch: Bioinformatics Algorithms. Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen 2013, ISBN 978-3-00-041316-2 , pp. 117-118.
  9. Enno Ohlebusch: Bioinformatics Algorithms. Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen 2013, ISBN 978-3-00-041316-2 , pp. 113-114.
  10. Enno Ohlebusch: Bioinformatics Algorithms. Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen 2013, ISBN 978-3-00-041316-2 , p. 130.
  11. Juha Kärkkäinen: Fast BWT in small space by blockwise suffix sorting. In: Theoretical Computer Science. Volume 387, No. 3, 2007, p. 251 ( PDF; 227KB )
  12. S. Burkhardt, J. Kärkkäinen: Fast lightweight suffix array construction and checking . In: Proceedings of the 14th Annual Symposium CPM 2003
  13. P. Ko, S. Aluru: Space efficient linear time construction of suffix arrays . In: Proceedings of the 14th Annual Symposium CPM 2003
  14. Juha Kärkkäinen, Peter Sanders: Simple linear work suffix array construction . (PDF; 193 kB) In_ Proc. 30th International Colloquium on Automata, Languages ​​and Programming (ICALP '03) . LNCS 2719, Springer, 2003, pp. 943-955
  15. MI Abouelhoda, S. Kurtz, E. Ohlebusch: Replacing suffix trees with suffix arrays . In J. Disc. Algor. 2 , 1, 2004, pp. 53-86
  16. ^ R. Grossi, J. Vitter: Compressed suffix arrays and suffix trees with applications to text indexing and string matching . In SIAM J. Comput. 35.2, 2005, pp. 378-407
  17. ^ G. Navarro, V. Mäkinen: Compressed full-text indexes . In_ ACM Comput. Serv. , 39, 1.2, 2007