Suffix array induced sorting

from Wikipedia, the free encyclopedia
Example for sorting the suffix arrays for the text immissiissippi $

Suffix-Array-Induced-Sorting (SAIS for short) is a method in computer science with which suffix arrays for any text can be constructed in linear time. The idea is to pre-sort the suffixes determined by recursion in order to then be able to classify the remaining suffixes by running through the text several times. As a result, this method sorts suffixes particularly efficiently.

Procedure of the algorithm

In the following, the algorithm is applied to a text that constructs a suffix array . With the suffix of ab index is given.

  1. Classify all suffixes in the text with (smaller) or (larger) by traversing the text from right to left. A character or suffix is classified as if the following suffix is ​​lexicographically larger, otherwise as .
  2. Mark a suffix of the type with if its predecessor is lexicographically larger
  3. Divide into buckets. Buckets represent intervals in which all suffixes begin with the same respective character.
  4. Divide the buckets into and into buckets . Within an upper bucket, the -Buckets are in front of the -Buckets.
  5. Sort the suffixes lexicographically and write them in their respective buckets.
  6. Scan from left to right. If there is a suffix instead and is of the type , then write in the next free position in the type bucket for the letter .
  7. Scan from right to left. If there is a suffix instead and is of the type , then write in the next free position in the type bucket for the letter from right to left.

Step 5 can be specified here:

  1. Sort the -Substrings and then assign super characters to them. Turn it into text .
  2. Compute suffix array for recursive.
  3. From then follows the sorting of the suffixes, inverse transformation of the indices from to .

example

The example text is used here , the dollar sign at the end symbolizes the end of the character string. The associated classification results in the following assignments:

index 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th
T i m m i s s i i s s i p p i $
Type S. L. L. S * L. L. S * S. L. L. S * L. L. L. S *

After classification has been carried out, they are now divided into buckets. It should be noted that the buckets are lexicographically ordered and their size depends on the number of characters that appear.

index 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th
T i m m i s s i i s s i p p i $
Type S. L. L. S * L. L. S * S. L. L. S * L. L. L. S *
Buckets $ i m p s

It can be seen that the bucket for the dollar sign is quite small because this character appears only once in the text. The bucket for the character, however, is comparatively large because it occurs six times in .

There is a further subdivision into and buckets, with the buckets before the buckets.

index 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th
T i m m i s s i i s s i p p i $
Type S. L. L. S * L. L. S * S. L. L. S * L. L. L. S *
Buckets $ i m p s
Bucket type S. L. S. L. L. L.

The suffixes are now classified in the respective buckets. For each character, all -suffixes are first sorted lexicographically and then written to the -Bucket. This is done in a sub- step by constructing from the given -Substrings . To simplify matters, the substrings are provided with super characters. is then of the form

Wherein , , and . The super characters to stand for the respective suffixes.

In order to be able to generate, the super characters are assigned indices. Then it is sorted lexicographically.

index 1 2 3 4th
T ' D. B. C. A.
Place in T 4th 7th 11 15th

is then of the form

Wherein , , and .

With this step, part of the complete suffix array is already pre-sorted, this can now be inserted into the buckets:

index 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th
T i m m i s s i i s s i p p i $
Type S. L. L. S * L. L. S * S. L. L. S * L. L. L. S *
Buckets $ i m p s
Bucket type S. L. S. L. L. L.
15th 7th 11 4th

The buckets are now passed through from left to right. If a suffix is ​​found for a corresponding position whose predecessor suffix is ​​of the type , this is written to the next free position in the bucket of the respective character.

index 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th
T i m m i s s i i s s i p p i $
Type S. L. L. S * L. L. S * S. L. L. S * L. L. L. S *
Buckets $ i m p s
Bucket type S. L. S. L. L. L.
15th 7th 11 4th
14th

This process is repeated until the end of has been reached. After the pass contains the following suffixes:

index 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th
T i m m i s s i i s s i p p i $
Type S. L. L. S * L. L. S * S. L. L. S * L. L. L. S *
Buckets $ i m p s
Bucket type S. L. S. L. L. L.
15th 7th 11 4th
14th 3 6th 10
2 13 5
12

Now the run is from right to left:

index 1 2 3 4th 5 6th 7th 8th 9 10 11 12 13 14th 15th
T i m m i s s i i s s i p p i $
Type S. L. L. S * L. L. S * S. L. L. S * L. L. L. S *
Buckets $ i m p s
Bucket type S. L. S. L. L. L.
15th 7th 11 4th
14th 3 6th 10
2 13 5
12 9
8th
4th
11
1

Finally, the sorted suffixes are put together from left to right. The result is the sorted suffix array:

Example implementation (pseudocode)

 1 sais(T,A)
 2 for i = n to 1 do
 3     if (T[i] >lex T[i+1])
 4         typ[i] <- L
 5         if (typ[i+1] = S) typ[i+1] = S*
 6     else
 7         typ[i] <- S
 8 
 9 begin <- {}
10 for j = 1 to n do
11        if (typ[j] = S*)
12          if (begin = {})
13            begin <- j
14 	 else
15            end <- p
16            T[q] <- CharacterFor(begin,end)
17            q <- q + 1
18            begin <- end
19 
20 If (everyCharacterInTIsUnique(T))
21 	A <- countingSort(T)
22         return A
23 else
24         A <- sais(T,A)
25 
26 for k = 1 to n do
27 	if (A[k] != {})
28 	  if (typ[A[k]-1] = L)
29 	    A <- writeToLBucketForCharacter(T[A[k]-1])
30 
31 for l = n to 1 do
32 	if (A[l] != {})
33 	  if (typ[A[l]-1] = S)
34 	    A <- writeToSBucketForCharacter(T[A[l]-1])
35 
36 return A

Further implementations can be found under. Including implementations in Java, as well as C.

running time

The suffix array induced sorting algorithm solves the problem of suffix sorting in a runtime of . The example implementation shows three loops of length that iterate over the text . The two indicated functions writeToLBucketForCharacter and writeToSBucketForCharacter jump within the array to the -Bucket or -Bucket for the specified character. In a naive implementation, it will be necessary to check a maximum of n times within the bucket whether there is a free space in the bucket in order to find a free space.

The sorting of the suffixes is interesting. Here the problem of sorting is limited to the suffixes by applying the SAIS algorithm recursively to them. The transferred text T 'is at most long since, by definition, a can only appear in every second position in the text. This results in the following combined runtime:

Overall, a running time of is needed.

use

The suffix-array induced sorting algorithm is used to create a suffix tree in time. It only forms a partial step between the text and the resulting suffix tree.

See also

literature

  • Ge Nong, Sen Zhang, Wai Hong Chan: Two Efficient Algorithms for Linear Time Suffix Array Construction. IEEE Trans. Computers 60 (10): 1471-1484 (2011)

Individual evidence

  1. Johannes Fischer: Lecture notes "Text Indexing and Information Retrieval" . Winter semester 2014/2015. Retrieved January 20, 2015.
  2. Yuta Mori: Sai's example implementations of Sais. Retrieved January 19, 2015 .