Canterbury Corpus

from Wikipedia, the free encyclopedia

The Canterbury Corpus is a collection of files to measure the performance and the degree of compression of various compression methods of lossless data compression. It was developed in 1997 by the University of Canterbury and is intended to replace the Calgary Corpus developed in 1980 .

purpose

The Canterbury Corpus was developed as a basis for applying metrics to newly developed data compression methods and is primarily used to create test cases for testing the algorithms during the development cycle . Although it can in principle also be used to compare different compression methods, the authors expressly distance themselves from this and refer to similar collections and resources. In addition, the Canterbury Corpus is intended exclusively for testing lossless compression methods.

Packages

The Canterbury Corpus consists of different packages, some of which contain highly specialized data depending on the test purpose and algorithm. The Canterbury Corpus package offers eleven files in text and binary formats , including: a. an excerpt from a work by William Shakespeare and is primarily used to compare the algorithm to be tested with other existing compression methods. The Artificial , Large and Miscellaneous packages offer files with synthetically generated content, particularly large files (e.g. the complete content of the CIA World Fact Book ) or purely numerical content. These packets are used to test a compression method in special situations.

Web links

Individual evidence

  1. http://corpus.canterbury.ac.nz/purpose.html