PAT Tree

from Wikipedia, the free encyclopedia

A PAT tree is a data structure for quickly finding words in texts. The basic idea is to see the entire text as a character string, to break it down into siStrings and to insert their binary representation into a Patricia Trie .

A siString is a character string that starts at any point in the text and extends to the end of the text at most.

history

PAT Trees were described by Gaston H. Gonnet, Ricardo A. Baeza-Yates and Tim Snider in their paper New indices for text: PAT Trees and PAT arrays in 1992 . You use Patricia-Tries described by Donald R. Morrison. Gernot Gwehenberger independently (published in the same month Oct. 1968) described a search method and data structure identical to the PATRICIA trie. In 1984 Gernot Gwehenberger used this data structure to answer the question: To what extent does the information content of genetic information differ from the information content of other information, for example that of randomly generated information (information content I = 1). What is meant here is the information content according to Shannon. The comparative study was carried out by dividing three equally long text strings into siStrings and combining them into a PATRICIA tree. The respective information content I was then obtained from the mean number of search steps A according to the formula A = log2 (N) / I + K (I). This formula was derived for the case of a text sequence whose redundancy R (R = 1-I) results from the fact that bits 1 and 0, from which the text sequence is composed, have a different probability. N is the number of siStrings and K is a constant dependent on I (for which a formula is given). The 3 text strings of 4361 characters each were genetic information of the bacterium Escherichia coli (plasmid pBR322), as well as the first 4361 alphanumeric characters of the publication as well as information generated randomly for control purposes (I = 1). The practical evaluation resulted in I = 0.9923 for the genetic information, I = 0.834 for the text and I = 1.0014 for the control.

Search methods

PAT trees make it possible to efficiently answer various types of inquiries.

  • Prefix search
  • Proximity search
  • Area search
  • Frequency search
  • Longest repeat search
  • Search for regular expressions
  • Determining the information content (according to Shannon) of a text sequence

literature

  • Search trees: a versatile tool. In: OUTPUT. No. 8/1984, Goldach CH specialist press.

Web links