Prediction by Partial Matching

from Wikipedia, the free encyclopedia

Prediction by partial matching ( PPM , English ) is a family of adaptive statistical data compression algorithms based on the context models and predictions builds. PPM models use a set of symbols from the previous symbol stream to predict the next symbol in the stream.

Predictions are usually limited to evaluations of the symbols. The number of preceding symbols,, defines the order of the PPM model, which is recorded as PPM (n) . Unlimited variants without restrictions on the length of the context also exist and are designated with PPM * . If no prediction can be made on the basis of all n context symbols, an attempt is made to predict on the basis of . This process is repeated until a hit is found or no symbols remain in context. At this point a forecast is made. This process is the reverse of this, followed by dynamic Markov predictions built from a 0-order model.

A large part of the work on optimizing a PPM model involves handling inputs that have not yet appeared in the input stream. The obvious way to deal with this is to create an "unknown symbol" that will trigger the escape sequence . But what probability should be assigned to a symbol that has never occurred? This is called the 0 frequency problem . One procedure assigns a fixed pseudo value of 1 to the "unknown symbol". A variant called PPM-D increases the pseudo value each time the "unknown symbol" appears. (In other words, PPM-D estimates the probability of a new symbol as the ratio of the number of unique symbols to the total number of symbols.)

Implementations of compression using PPM are very different in other details. The actual symbol selection is usually coded arithmetically , although Huffman coding or some kind of dictionary coding is also possible. The underlying model of most PPM algorithms can also be extended to predict multiple symbols. It is also possible to use other than Markov model creation to either replace it entirely or to supplement it. The symbol size is usually static, typically a single byte, which makes generic support for any file format easy.

Publications about research on this family of algorithms can be found back to the mid-1980s. Software implementations were not popular until the early 1990s because PPM algorithms require a considerable amount of memory . Recent implementations of PPM can be found among the most powerful lossless data compression methods for text in natural languages .

Attempting to improve PPM algorithms resulted in the PAQ compression algorithms .

literature

  • J. Cleary, I. Witten: Data Compression Using Adaptive Coding and Partial String Matching . In: Communications, IEEE Transactions on . tape 32 , no. 4 , 1984, pp. 396-402 , doi : 10.1109 / TCOM.1984.1096090 .
  • A. Moffat: Implementing the PPM data compression scheme . In: Communications, IEEE Transactions on . tape 38 , no. 11 , 1990, pp. 1917–1921 , doi : 10.1109 / 26.61469 .
  • JG Cleary, WJ Teahan, IH Witten: Unbounded length contexts for PPM . In: Proceedings DCC-95 . IEEE Computer Society Press, 1995 ( PDF ). Alternatively: JG Cleary, WJ Teahan: Unbounded Length Contexts for PPM . In: The Computer Journal . tape 40 , no. 2–3 , 1997, pp. 67-75 , doi : 10.1093 / comjnl / 40.2_and_3.67 .
  • C. Bloom, Solving the problems of context modeling ( ZIP ; 43 kB).
  • W. J Teahan: Probability estimation for PPM . In: Proceedings NZCSRSC'95 . 1995 ( HTML [accessed February 28, 2011]).
  • Thomas Schürmann, Peter Grassberger : Entropy estimation of symbol sequences . In: Chaos: An Interdisciplinary Journal of Nonlinear Science . tape 6 , no. 3 , 1996, p. 414 , doi : 10.1063 / 1.166191 , arxiv : cond-mat / 0203436v1 .

Web links