Suffix array induced sorting
Suffix-Array-Induced-Sorting (SAIS for short) is a method in computer science with which suffix arrays for any text can be constructed in linear time. The idea is to pre-sort the suffixes determined by recursion in order to then be able to classify the remaining suffixes by running through the text several times. As a result, this method sorts suffixes particularly efficiently.
Procedure of the algorithm
In the following, the algorithm is applied to a text that constructs a suffix array . With the suffix of ab index is given.
- Classify all suffixes in the text with (smaller) or (larger) by traversing the text from right to left. A character or suffix is classified as if the following suffix is lexicographically larger, otherwise as .
- Mark a suffix of the type with if its predecessor is lexicographically larger
- Divide into buckets. Buckets represent intervals in which all suffixes begin with the same respective character.
- Divide the buckets into and into buckets . Within an upper bucket, the -Buckets are in front of the -Buckets.
- Sort the suffixes lexicographically and write them in their respective buckets.
- Scan from left to right. If there is a suffix instead and is of the type , then write in the next free position in the type bucket for the letter .
- Scan from right to left. If there is a suffix instead and is of the type , then write in the next free position in the type bucket for the letter from right to left.
Step 5 can be specified here:
- Sort the -Substrings and then assign super characters to them. Turn it into text .
- Compute suffix array for recursive.
- From then follows the sorting of the suffixes, inverse transformation of the indices from to .
example
The example text is used here , the dollar sign at the end symbolizes the end of the character string. The associated classification results in the following assignments:
index | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | i | m | m | i | s | s | i | i | s | s | i | p | p | i | $ |
Type | S. | L. | L. | S * | L. | L. | S * | S. | L. | L. | S * | L. | L. | L. | S * |
After classification has been carried out, they are now divided into buckets. It should be noted that the buckets are lexicographically ordered and their size depends on the number of characters that appear.
index | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | i | m | m | i | s | s | i | i | s | s | i | p | p | i | $ |
Type | S. | L. | L. | S * | L. | L. | S * | S. | L. | L. | S * | L. | L. | L. | S * |
Buckets | $ | i | m | p | s |
It can be seen that the bucket for the dollar sign is quite small because this character appears only once in the text. The bucket for the character, however, is comparatively large because it occurs six times in .
There is a further subdivision into and buckets, with the buckets before the buckets.
index | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | i | m | m | i | s | s | i | i | s | s | i | p | p | i | $ |
Type | S. | L. | L. | S * | L. | L. | S * | S. | L. | L. | S * | L. | L. | L. | S * |
Buckets | $ | i | m | p | s | ||||||||||
Bucket type | S. | L. | S. | L. | L. | L. |
The suffixes are now classified in the respective buckets. For each character, all -suffixes are first sorted lexicographically and then written to the -Bucket. This is done in a sub- step by constructing from the given -Substrings . To simplify matters, the substrings are provided with super characters. is then of the form
Wherein , , and . The super characters to stand for the respective suffixes.
In order to be able to generate, the super characters are assigned indices. Then it is sorted lexicographically.
index | 1 | 2 | 3 | 4th |
---|---|---|---|---|
T ' | D. | B. | C. | A. |
Place in T | 4th | 7th | 11 | 15th |
is then of the form
Wherein , , and .
With this step, part of the complete suffix array is already pre-sorted, this can now be inserted into the buckets:
index | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | i | m | m | i | s | s | i | i | s | s | i | p | p | i | $ |
Type | S. | L. | L. | S * | L. | L. | S * | S. | L. | L. | S * | L. | L. | L. | S * |
Buckets | $ | i | m | p | s | ||||||||||
Bucket type | S. | L. | S. | L. | L. | L. | |||||||||
15th | 7th | 11 | 4th |
The buckets are now passed through from left to right. If a suffix is found for a corresponding position whose predecessor suffix is of the type , this is written to the next free position in the bucket of the respective character.
index | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | i | m | m | i | s | s | i | i | s | s | i | p | p | i | $ |
Type | S. | L. | L. | S * | L. | L. | S * | S. | L. | L. | S * | L. | L. | L. | S * |
Buckets | $ | i | m | p | s | ||||||||||
Bucket type | S. | L. | S. | L. | L. | L. | |||||||||
15th | 7th | 11 | 4th | ||||||||||||
14th |
This process is repeated until the end of has been reached. After the pass contains the following suffixes:
index | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | i | m | m | i | s | s | i | i | s | s | i | p | p | i | $ |
Type | S. | L. | L. | S * | L. | L. | S * | S. | L. | L. | S * | L. | L. | L. | S * |
Buckets | $ | i | m | p | s | ||||||||||
Bucket type | S. | L. | S. | L. | L. | L. | |||||||||
15th | 7th | 11 | 4th | ||||||||||||
14th | 3 | 6th | 10 | ||||||||||||
2 | 13 | 5 | |||||||||||||
12 |
Now the run is from right to left:
index | 1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 13 | 14th | 15th |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
T | i | m | m | i | s | s | i | i | s | s | i | p | p | i | $ |
Type | S. | L. | L. | S * | L. | L. | S * | S. | L. | L. | S * | L. | L. | L. | S * |
Buckets | $ | i | m | p | s | ||||||||||
Bucket type | S. | L. | S. | L. | L. | L. | |||||||||
15th | 7th | 11 | 4th | ||||||||||||
14th | 3 | 6th | 10 | ||||||||||||
2 | 13 | 5 | |||||||||||||
12 | 9 | ||||||||||||||
8th | |||||||||||||||
4th | |||||||||||||||
11 | |||||||||||||||
1 |
Finally, the sorted suffixes are put together from left to right. The result is the sorted suffix array:
Example implementation (pseudocode)
1 sais(T,A)
2 for i = n to 1 do
3 if (T[i] >lex T[i+1])
4 typ[i] <- L
5 if (typ[i+1] = S) typ[i+1] = S*
6 else
7 typ[i] <- S
8
9 begin <- {}
10 for j = 1 to n do
11 if (typ[j] = S*)
12 if (begin = {})
13 begin <- j
14 else
15 end <- p
16 T’[q] <- CharacterFor(begin,end)
17 q <- q + 1
18 begin <- end
19
20 If (everyCharacterInT’IsUnique(T’))
21 A <- countingSort(T’)
22 return A
23 else
24 A <- sais(T’,A)
25
26 for k = 1 to n do
27 if (A[k] != {})
28 if (typ[A[k]-1] = L)
29 A <- writeToLBucketForCharacter(T[A[k]-1])
30
31 for l = n to 1 do
32 if (A[l] != {})
33 if (typ[A[l]-1] = S)
34 A <- writeToSBucketForCharacter(T[A[l]-1])
35
36 return A
Further implementations can be found under. Including implementations in Java, as well as C.
running time
The suffix array induced sorting algorithm solves the problem of suffix sorting in a runtime of . The example implementation shows three loops of length that iterate over the text . The two indicated functions writeToLBucketForCharacter and writeToSBucketForCharacter jump within the array to the -Bucket or -Bucket for the specified character. In a naive implementation, it will be necessary to check a maximum of n times within the bucket whether there is a free space in the bucket in order to find a free space.
The sorting of the suffixes is interesting. Here the problem of sorting is limited to the suffixes by applying the SAIS algorithm recursively to them. The transferred text T 'is at most long since, by definition, a can only appear in every second position in the text. This results in the following combined runtime:
Overall, a running time of is needed.
use
The suffix-array induced sorting algorithm is used to create a suffix tree in time. It only forms a partial step between the text and the resulting suffix tree.
See also
literature
- Ge Nong, Sen Zhang, Wai Hong Chan: Two Efficient Algorithms for Linear Time Suffix Array Construction. IEEE Trans. Computers 60 (10): 1471-1484 (2011)
Individual evidence
- ↑ Johannes Fischer: Lecture notes "Text Indexing and Information Retrieval" . Winter semester 2014/2015. Retrieved January 20, 2015.
- ↑ Yuta Mori: Sai's example implementations of Sais. Retrieved January 19, 2015 .