Wavelet tree

from Wikipedia, the free encyclopedia
A wavelet tree from the string "abracadabra". Each node divides the string into two parts depending on the alphabet. A bit vector assigns each character from the sequence to its area. It should be noted that the data structure does not save the tree, but only the topology and the bit vectors. The strings in the nodes are for illustration only.

In computer science, a wavelet tree is a compact data structure for storing character strings in compressed form. He extends the methods and from a bit vector to any alphabet .

The data structure was described for the first time as the main component for compressed full-text indexing and is considered to be a minor generalization of a data structure from algorithmic geometry . The wavelet tree can be described recursively . Each node distributes the string to its two successors. The remaining alphabet is divided among the child nodes. A bit vector stores the allocated partition for each character.

The origin of the name of the trees lies in the wavelet transformation , used to reduce image data and for the approximate evaluation of expressions in relational algebra.

construction

A wavelet tree of the sequence above the alphabet can be described recursively using a partial alphabet . A wavelet tree above the alphabet is a binary balanced tree with leaves.

  1. If so, the tree consists of only one leaf with the label .
  2. Otherwise the tree has a root node with a bitmap , which is defined as follows:
Let the partial sequence consisting of the symbols and the partial sequence consisting of the symbols . Then the left child of is a wavelet tree from above the alphabet and the right child of a wavelet tree from above the alphabet .

properties

The tree described has a height of , leaves and internal nodes. It stores bits at each level and at most in the lowest level. The tree can thus be represented with a maximum of bits. If you look closely, this representation requires additional bits for the pointers.

Let a character string with length from the alphabet be represented as a wavelet tree with bits.

Operations

The wavelet tree supports the operations , and in time, if a balanced tree was constructed.

Access

: Direct access to the i'th element in the string.

In order to calculate the character at the position , the bit vector is considered. If the value is at this position , then and we continue the procedure recursively on the left child node, otherwise it applies and the algorithm processes the right child. To do this, the new position of must be determined in the bit vector . The new position is the number of zeros in the vector up to position i, if applies. If the right child is processed recursively, the occurrences of the ones must be added up. The function or a bit vector is used for this.

The rank function on bit vectors can be evaluated in constant time with the help of additional bits.

rank

: Number of characters up to position i in the string.

The rank is determined in the same way as for the access operation. After executing the Access algorithm, the rank results from the number of occurrences of up to position i in the leaf node.

Select

: Position of the i-th occurrence of the character q in the character string.

To determine this position the algorithm starts with the leaf that represents q. The algorithm now runs through the nodes recursively to the root: If the node is a left child, the new position in the parent node results from the position of the ith in the associated bit vector. If the child is a right-hand successor, the new position results from the position of the i-th . This select operation on a bit vector can be carried out in constant time with additional .

compression

The space consumption of can be reduced by removing redundancies using different methods to bits with the same operating times or bits and constant running times for rank and select.

application

This data structure is used in a wide variety of applications. Wavelet trees are used in applications to represent three different classes.

Sequence of values

The wavelet tree represents a string. The operations used are the three basic operations mentioned on the tree. This representation is the most common.

Sorting

The tree describes an orderly representation of the outgoing string . The leaves of the tree represent the sorted sequence . This results in two additional operations. returns the position of the character in the sorted sequence. Conversely, the ascent from the r'th leaf to the root results in the position of the element i . This representation was used by the inventor of Wavelet Trees.

Grid of points

The wavelet tree represents a number of points.

Extensions

Some extensions of the trees can be found in the literature. To minimize the height of wavelet trees, t'nary instead of binary nodes are used. Thus, the node degree increases to t children and the depth of the tree decreases. Operations such as inserting and deleting characters at any position in the character string increase the dynamics of the wavelet tree and enable dynamic FM indexes to be supported .

Web links

  • Wavelet trees . Blog describing the construction and queries of a wavelet tree with examples.

Individual evidence

  1. a b R. Grossi, A. Gupta, and JS Vitter, High-order entropy-compressed text indexes (PDF; 292 kB), Proceedings of the 14th Annual SIAM / ACM Symposium on Discrete Algorithms (SODA) , January 2003, 841 -850
  2. B. Chazelle, A functional approach to data structures and its use in multidimensional searching , SIAM Journal on Computing , Volume 17, Issue 3, June 1988, Pages 427-462
  3. a b G. Navarro, Wavelet Trees for All (PDF; 397 kB), Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM) , 2012
  4. ^ G. Jacobson, Space-efficient static trees and graphs (PDF; 381 kB), International Journal of Foundations of Computer Science (IJFCS) , 1989, Pages 549-554
  5. ^ D. Clark, Compact Pat Tree (PDF; 6.7 MB), University of Waterloo, Canada , 1996
  6. ^ I. Munro, Tables, University of Waterloo, Canada , 1996, Pages 37-42
  7. V. Mäkinen, G. Navarro, Position-Restricted Substring Searching , Springer Heidelberg, Technical Faculty Bielefeld University , 2006, Pages 703-714
  8. V.Mäkinen, G. Navarro, rank and select revisited and extended , Springer Heidelberg, University of Helsinki , 2007, Pages 332-347
  9. A. Golynski, R. Grossi, A. Gupta, R. Raman, On the Size of Succinct Indices , Springer Heidelberg , 2007, Pages 371-382
  10. a b P. Ferragina, R. Giancarlo, G. Manzini, The myriad virtues of Wavelet Trees (PDF; 529 kB), Information and Computation , Volume 207, Issue 8, August 2009, Pages 849-866
  11. P. Ferragina, G. Manzini, V. Mäkinen, G. Navarro, An Alphabet-Friendly FM-Index , Springer Heidelberg , 2004, Pages 150-160
  12. P. Ferragina, G. Manzini, V. Mäkinen, G. Navarro, Compressed representations of sequences and full-text indexes , Association for Computing Machinery (ACM) , 2007, Article 20
  13. H.-L. Chan, W.-K. Hon, T.-W. Lam, and K. Sadakane, Compressed Indexes for dynamic text collections, ACM Transactions on Algorithms , 3 (2), 2007