LCP array

from Wikipedia, the free encyclopedia

The LCP array is a data structure from computer science , which is mostly used in combination with the suffix array . The designation “LCP” is an abbreviation for “ longest common prefix ” (German longest common prefix ). The array itself contains the length of the longest common prefix of two lexicographically consecutive suffixes .

There are numerous applications for the LCP array in the field of text search and indexing, such as the construction of the suffix tree or an efficient search for all occurrences of a search pattern in a text. The required storage space of the LCP array is linear compared to the text size and there are algorithms that construct the LCP array in linear time with the help of the suffix array. It was first used in a paper by Manber and Myers (1993) in which it was referred to as the Hgt array.

definition

Be some text of length and be the suffix array above the text . Also denote the suffix and the length of the longest common prefix of the two strings and .

The LCP array is an array of size , with the individual fields defined as follows:

The suffix array contains a lexicographical sorting of all suffixes of . To put it more informally, an entry in the LCP array always refers to two lexicographically consecutive suffixes of and describes the length of the longest common prefix of the suffixes considered. The value of is undefined because the lexicographically smallest suffix of is and has no predecessor.

example

Look at the text .

i 1 2 3 4th 5 6th 7th 8th 9 10 11 12
T [i] m i s s i s s i p p i $

The suffix array of represents the sorting of the suffixes of the text, whereby the index of the first letter of the respective suffix is ​​stored in the array. The suffix sorting for the text looks like this:

i 1 2 3 4th 5 6th 7th 8th 9 10 11 12
A [i] 12 11 8th 5 2 1 10 9 7th 4th 6th 3
1 $ i i i i m p p s s s s
2 $ p s s i i p i i s s
3 p s s s $ i p s i i
4th i i i s $ p s p s
5 $ p s i i i p s
6th p s s $ p i i
7th i i s p $ p
8th $ p i i p
9 p p $ i
10 i p $
11 $ i
12 $

The LCP array contains the length of the longest common prefix of two consecutive suffixes. It can be constructed by comparing the suffixes in the suffix order character by character. For example, to calculate the value, the suffixes beginning with and are compared with one another: The longest common prefix of and is and therefore has a length of 2. Correspondingly, is . The remaining values ​​of the LCP array can be calculated in the same way. The value of remains undefined because there is no lexicographically available suffix . The LCP array for the text looks like this:

i 1 2 3 4th 5 6th 7th 8th 9 10 11 12
Hi] 0 1 1 4th 0 0 1 0 2 1 3

calculation

The easiest way to calculate the LCP array would be (as in the example above) to compare the lexicographically consecutive suffixes character by character with the help of the suffix array, in order to calculate the length of the longest common prefix. However, in the worst case, this method results in a running time of . If a text contains the same characters, for example , comparisons would have to be made overall .

A more efficient approach is based on the basic idea of ​​calculating the LCP entries in text order. Suppose and are consecutive values ​​in the suffix array and and have a common prefix of characters. The suffixes and then have common characters. However, and do not have to be next to each other in the suffix array; there may well be suffixes that are lexicographically between and . Suppose is the value before the value is in the suffix array. Then and because of the lexicographical sorting of the suffixes must have at least common characters (because it lies between and in the suffix array ), i.e. it would be sufficient to compare the two suffixes starting with the -th character.

For this approach, the inverse suffix array is needed to find out where a certain value occurs in. This is the inverse permutation of , that is , indicates where in the suffix array the value is located.

The following algorithm results:

 1 LCP-Array(T, A, n)
 2    for (i=1 to n)  {
 3       Ainv[A[i]] = i;
 4    }
 5    H[1] = 0;
 6    h = 0;
 7    for (i=1 to n) {
 8       if (Ainv[i]  1) {
 9          j = A[Ainv[i] - 1]
10          while T[i+h] = T[j+h]
11             h = h + 1
12          H[Ainv[i]] = h
13          h = max(0, h - 1)
14       }
15    }

The running time is linear because the maximum it can assume is the value . Since each step only decrements by 1, the inner while loop is executed at most times. This results in a total running time of .

The approach presented above comes from Kasai et al. (2001) and is the first published algorithm that calculates the LCP array in linear time. Manzini (2004) developed an improved version that does not require any additional memory besides the actual text, the suffix array and the LCP array. Another improvement is the φ algorithm by Kärkkäinen, Manzini and Puglisi: While the original algorithm caused cache misses through non- sequential access to the arrays for correspondingly long texts (namely in lines 3, 9, 10 and 12), the φ algorithm manages with only cache misses.

The algorithms mentioned above use an existing suffix array to calculate the LCP array. There are algorithms that compute the LCP array at the same time as the suffix array. The currently fastest linear time algorithm comes from Fischer (2011).

Gog & Ohlebusch (2011) published two algorithms that have a quadratic running time in the worst case, but are faster in practice than the linear time algorithms mentioned above.

Accelerated text search with the help of the LCP array

With the suffix array, the occurrence of a pattern of length in a text of length can be determined with the help of binary search . Since the suffixes in the suffix array are sorted lexicographically, comparisons are sufficient to find the lexicographically smallest (or largest) suffix that contains. For each comparison, up to characters have to be compared, which means that this method has a runtime of .

With the help of the LCP array, the runtime can be improved by preventing the pattern from having to be read again from the beginning with each step of the binary search .

At each step of the binary search there is a left interval limit , a right interval limit and a middle element . It is compared with the lexicographical suffix (i.e. with ). If both strings match, the binary search is finished, otherwise the search must be continued in either the left or right half of the interval. We look at the case that lexicographically smaller than is - the other case is analogous. Let be the length of the longest common prefix of the two strings. This means that the -th character of is lexicographically smaller than that of .

The new interval has the limits and and the middle element . Let be the length of the longest common prefix between the old and the new middle element. A case distinction follows:

  • : The -th character of the suffixes and is the same. Therefore, it must also be lexicographically smaller than and the binary search can be continued in the left half of the interval without further comparisons.
  • : Because of , the suffix is lexicographically smaller than the suffix . This means that the -th character of the suffix must be smaller than the -th character of the suffix . Since and have a common prefix of at least characters, it follows here that lexicographically is smaller than P and the binary search is continued in the right half.
  • : and have a common prefix of at least characters. Both strings must be compared here, whereby the comparison must only begin with the -th character.

With the method described, it is never necessary to read characters from again in the binary search . The method comes from Manber and Myers. Since the LCP array only contains the values ​​for lexicographically consecutive suffixes, the following shows how any value values ​​can be calculated efficiently.

In general, the longest common prefix of two suffixes can be found using a Range Minimum Query over the LCP array. For two suffixes , consider all suffixes that are lexicographically in between: If two successive suffixes have a longest common prefix of length , and because of the lexicographical sorting, cannot have a longer common prefix. The value therefore corresponds exactly to the minimum above the LCP entries . This minimum can be found in constant time using suitable methods for range minimum queries.

literature

  • Enno Ohlebusch Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction . Oldenbusch Verlag, 2013, Chapter 4.

Individual evidence

  1. a b Udi Manber, Gene Myers: Suffix Arrays: A New Method for On-Line String Searches . In: SIAM Journal on Computing . 22, No. 5, 1993, p. 935. doi : 10.1137 / 0222058 .
  2. Kasai, T. et al .: Linear-Time Longest-Common-Prefix Computation in Suffix . In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, 2001.
  3. ^ Manzini, Giovanni: Two Space Saving Tricks for Linear Time LCP Array Computation . In: Lecture Notes in Computer Science, Volume 3111, Page 372, 2004.
  4. Kärkkäinen, Juha; Manzini, Giovanni; Puglisi, Simon J .: Permuted Longest-Common-Prefix Array . In: Lecture Notes in Computer Science, Volume 5577, Page 181, 2009.
  5. ^ Fischer, Johannes: Inducing the LCP-Array . In: Lecture Notes in Computer Science, Volume 6844, pp. 374–385, 2011.
  6. Gog, Simon; Ohlebusch, Enno: Fast and Lightweight LCP-Array Construction Algorithms . In: Proceedings of the Workshop on Algorithm Engineering and Experiments, ALENEX, pp. 25-34, 2011.
  7. ^ Fischer, J. and V. Heun: Theoretical and practical improvements on the RMQ problem, with applications to LCA and LCE . In: Combinatorial Pattern Matching . 2006, pp. 36-48. doi : 10.1007 / 11780441_5 .
  8. ^ Fischer, J. and V. Heun: Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays . In: SIAM J. Comput. 40 (apr) . 2011, pp. 465-492. doi : 10.1137 / 090779759 .