British National Corpus

from Wikipedia, the free encyclopedia

The British National Corpus (BNC) is an English text corpus in the form of a 100 million word collection of written and spoken language . It comprises a large number of different sources in order to be able to provide a representative cross-section of British English of the late 20th century for academic purposes.

features

Around ninety percent of the BNC consists of written language data, such as excerpts from regional and national newspapers , trade journals, magazines from many different areas of interest, academic books, fiction (novels etc.), official and private letters, essays from schools and universities and many others Text types .

The remaining ten percent are oral language data and for the most part contain spontaneous speech recordings from everyday life, which were recorded by volunteers of different ages, origins and different social classes in order to achieve a demographic balance. The recorded conversations arose in a variety of situations, ranging from formal business and government meetings to radio broadcasts and telephone calls.

Work on the BNC began in 1991 and lasted until 1994. After the project was completed, no new texts were added, but the text corpus was slightly revised before the publication of the second edition under the name "BNC World". Two sub-corpora with excerpts from the BNC have been published: the BNC Sampler (a collection of one million words each of written and spoken language) and BNC Baby (four million words from four different types of text).

The BNC has four main properties with regard to the determination criteria of text corpora:

  • It's monolingual. The BNC includes modern British English with no data from other languages ​​used in the British Isles. Nonetheless, words of non-British origin appear in the BNC.
  • It's synchronous . The BNC only covers British English of the late 20th century. It does not allow any insight into the historical developments that produced it, and no historical comparisons.
  • It's general. The BNC includes many different styles and varieties and is not limited to a specific subject area, genre or register .
  • It contains text excerpts (samples). For the written sources 45,000 words were taken from different parts of a single text by the respective author. However, shorter texts up to 45,000 words in length and texts by multiple authors (such as magazines and newspaper articles) have been fully incorporated into the BNC. The inclusion of text excerpts enables a wider range of different types of text to be represented within the 100 million mark and thus avoids overrepresentation of idiosyncratic texts.

Web links