Text corpus

from Wikipedia, the free encyclopedia

A text corpus ( neuter ; plural text corpora ; short also just corpus or corpus ; Latin corpus 'body' ) is a collection of written texts or verbally recorded verbal utterances of a certain language or text genre .


Text corpora are examined in various scientific disciplines, mainly in linguistics and literary studies as well as historical and social science- oriented subjects such as ethnology or cultural anthropology . The corpora are a means with which, for example, a certain language or language variety can be described or the works of a certain author or a group of authors can be recorded and researched. But they also serve as sources for investigating other issues, such as sociolinguistics . For jurisprudence and legal history , text corpora are important sources of law : A body of law, for example the corpus iuris civilis , is a grown collection of normative texts.

For linguistic purposes, certain types and quantities of texts from living languages ​​are compiled in text corpora according to scientific criteria. With the advent of machine recording options through digitization, such collections have become very important in many linguistic disciplines. The new auxiliary science of corpus linguistics developed from this .

A text corpus is now typically available in digital form. For the purposes of language description, large corpora, that is to say many millions and sometimes several billions of words, were created for numerous national languages, which are intended to map a certain ratio of individual text types in the respective language. In addition, there are numerous special corpora such as children's language corpora, dialect corpora, corpora consisting of complete editions of literary works, and the like. a. m. Text corpora specially designed for individual linguistic examinations are also increasingly being created.

Types of text corpora

Text corpora can be categorized in different ways according to formal and content-related criteria. First and foremost one differentiates:

Paper corpora and electronic corpora
Text corpora compiled on paper were laborious to create and accordingly rarely found. In the past, they played an important role in dictionary writing, for example , as the meanings of individual words were identified or documented on the basis of these collections.
Special software such as WordSmith is required to use today's machine-readable corpora . However, a number of corpora are accessible online and can be used on your own PC without such software.
Partial corpora and reference corpora
Partial corpora are those that only offer an excerpt from the entire spectrum of a language, such as text corpora that only contain texts from everyday colloquial language or only texts from daily newspapers.
A reference corpus is a text corpus that, according to linguistic criteria, is intended to representatively cover a single language (i.e. German, English, etc.) in its entirety in such a way that valid statements about the system of this language can be made in general using a reference corpus of a specific language.
Static corpora and monitor corpora
Static corpora have been closed and will no longer be expanded, for example text corpora with the works of a deceased writer, a corpus consisting of the entirety of all written sources in an extinct language or a corpus made up of the written records of recordings of a small child acquiring the language (old Languages ​​that are only documented in a few documents or even only fragmentarily are also referred to as “corpus languages” because they can only be reconstructed and described using this limited, no longer expandable text corpus).
Monitor corpora, on the other hand, are text corpora that are designed to be expanded (such as text collections consisting of the articles of a current daily newspaper). They are therefore as Monitorkorpora called because they under a constant systematic observation and recording, a monitoring , are.
Raw corpora and annotated corpora
Raw corpora are text corpora that consist purely of the language data used for the investigation. Annotated corpora are text corpora which, in addition to these primary data, also contain additional information, so-called metadata. These annotations can be of very different types: common are, for example, corpora in which the respective part of speech is specified for each individual word , corpora that contain glosses (although the target language does not have to correspond to that of the corpus), or corpora that contain information on the syntax of the individual sentences is provided (the latter are also referred to as “ tree banks ” - analogous to the expression “database” , as so-called syntactic tree structures are annotated in them ). Text corpora consisting of spoken language data are often enriched with phonological data. The metadata of a text corpus also includes information about the time the text was created, about the authorship, about the corpus creation, etc. a. m.
Annotated corpora offer fundamentally improved research opportunities, especially for questions in theoretical linguistics or computational linguistics . However, the annotation of extensive text corpora is relatively complex and consequently costly, so that the large reference corpora in particular are only partially provided with annotations.
Monolingual and multilingual corpora
Monolingual corpora allow statements to be made about the respective individual language. Multilingual corpora contain texts from mostly two, possibly several languages. Either the texts in the second language are a translation of the texts in the first language - such cases are referred to as "parallel corpora" - or the corpus of the second language consists to the same extent of the same types of text as the corpus of the first language (e.g. Newspaper articles on the same subjects).
Multilingual corpora play a role primarily in machine translation and language teaching research . The automatic or statistical analysis z. B. the frequency and distribution of certain words within individual languages ​​for the automatic creation of a bilingual dictionary .
Some functions of a multilingual text corpus, without actually being one, are often taken over by the Bible because it is also available in smaller, less often spoken languages. It is therefore not only useful for linguistic comparison purposes, but also of great importance in biblical studies, for example in relation to research into translation habits and the understanding of biblical terms.

Text corpora in linguistics

Text corpora offer the possibility of examining the system of a language and its use on the basis of actually uttered language data in various ways. The term “corpus” in the sense of a compilation of linguistic data in order to make general statements based on these samples has been used in various disciplines of linguistics for decades.

This empirical orientation contrasts with the rationalistic orientation of generative grammar , which is currently a dominant paradigm in theoretical linguistics. The use and benefits of text corpora, especially with regard to questions about grammar , are accordingly viewed critically by representatives of this direction . However, in this area too, corpora are increasingly being used to verify hypotheses.

Linguistic subareas in which text corpora are increasingly used are corpus linguistics and computational linguistics . Corpora as large as possible are evaluated here in order to be able to make general statements about a language. Examples of the use of corpora in corpus linguistics include determining word meanings based on concordances (i.e. based on references in specific texts), determining collocations ( i.e. when a word occurs together with certain other words) or answering questions about Syntax of a language. In the field of computational linguistics and mathematical linguistics , among other things, word frequencies and word distributions in texts, word collocations or sentence and word lengths and the like are of interest. In the linguistic sub-area of discourse analysis , text corpora of different sizes, primarily from the public linguistic area (politics, media), are used to draw conclusions from such language data about latent attitudes and attitudes of a social group to certain things and facts or to find out their understanding of certain terms .

Although the World Wide Web is also a collection of specifically used language, it should not be viewed as a text corpus in the true sense of the word. Nevertheless, it is used with appropriate caution under certain restrictions for certain questions. For example, in addition to various printed texts, regional websites were also used as part of the creation of the German dictionary of variants .

Reference corpora of individual languages

To describe national languages ​​or linguistic varieties , extensive text corpora are created, which today can very often also be used online. In the latter cases, the analysis software required for this is already implemented in the World Wide Web and can be used by users without having to install such a program on their own PC.

The first text corpus in a national language variety was the Brown Corpus, created in the 1960s and completely annotated according to 80 parts of speech, which was supposed to represent contemporary American English. (The name is derived from Brown University in Providence in the US state of Rhode Island , where the corpus was created.) It comprises 1 million words and consists of 500 text excerpts of 2,000 words each, with texts from 15 different ones Text types (different types of newspapers and literary texts, religious texts, specialist literature, etc.) were used. The view that a text sample with a size of 2,000 words represents the type of text for a text corpus is still valid today. The Brown Corpus served as the basis for the American Heritage Dictionary , the first dictionary that was created exclusively on the basis of such a corpus. The Brown Corpus was followed, among other things, in the 1980s by the fully annotated Lancaster-Oslo-Bergen-Corpus (short: LOB Corpus), which, based on the model of the Brown Corpus, consists of texts in British English.

Today, the British National Corpus , the American National Corpus and the International Corpus of English (with texts from different English-speaking countries) are important for English.

The German reference corpus compiled at the Institute for German Language in Mannheim is currently the most extensive corpus of German.It consists of over 43 billion words (as of March 2019) from the written language and is basically open to everyone.

As part of the research project " Digital Dictionary of the German Language of the 20th Century ", the largest balanced text corpus of the German language of the 20th century was made available. There are also other corpora, such as the complete online archives of the magazine “Die Zeit” (from 1996), the “Tagesspiegel” (from 1996) and the “Potsdamer Neuesten Nachrichten” as well as a large corpus of Jewish periodicals ( Germania Judaica ) . The corpora are linked to a large monolingual German dictionary, the dictionary of contemporary German . When a keyword is queried, not only the concordances, but also information on synonyms , hyponyms , hyperonyms and collocations are generated.

The Department of Automatic Language Processing at the University of Leipzig also works on and with large corpora and, among other things, maintains a corpus of around 1.5 billion words (around 100 million sentences). The statistical data of a reduced corpus can also be queried online in the vocabulary portal of the University of Leipzig .

There has also been an online Swiss text corpus since 2010.

Large corpora also exist in many other national languages ​​today. This applies not only to the Indo-European language area , but also to other high-speaker languages, especially in Asia. But also smaller languages ​​of Asia and Africa are documented in the form of text archives or less extensive annotated text corpora.

Special text corpora

In addition to the large reference corpora, there is an ever increasing number of text collections that can be found not only under the name “corpus”, but also as “(text) archive” or under the keyword “database”. These include, for example, dialect corpora or corpora of spoken language, such as those in the Bavarian Archive for Language Signals and the Archive for Spoken German . Another type of special corpora are text complete editions such as the Austrian Academy Corpus prepared at the Austrian Academy of Sciences , which comprises the complete editions of the essayistic journals “ Die Fackel ” and “ Der Brenner ”.

Especially for psycholinguistics and clinical linguistics , the database “ CHILDES ”, in which transcripts of spoken children's language are extensively available, is important for researching normal and also impaired language acquisition in children .

As part of large-scale projects to digitize old book collections, more and more encyclopedias, dictionaries, encyclopedias and literary works are being recorded and made available online. These include companies such as the “ German Text Archive ”, which aims to provide a comprehensive selection of historical texts from several centuries. In the best case scenario, such text collections offer a free full-text search in the entire inventory that can be carried out online . However, in such cases it is often not possible to use these texts for linguistic purposes in the same convenient way as specially designed corpora, since the search software is not designed for them.

Another special corpus is the Google Books corpus , the raw data of which can be evaluated online by anyone with the Google Books Ngram Viewer in the form of diagrams for the frequency of characters or words.


  • German Institute f. Normung eV (Ed.): Structure and use of terminology databases and text corpora. German translation of the international department ISO / TR 12618, created in NA Terminology. 1st edition. Berlin / Vienna / Zurich 1997.
  • Paul Baker: Using Corpora in Discourse Analysis. Continuum, London / New York 2009, ISBN 978-0-8264-7724-8 .
  • Reinhard Fiehler, Peter Wagener: The Spoken German Database (DGD) - Collection, documentation, archiving and investigation of spoken language as a task of linguistics. In: Conversation Research - Online Journal for Verbal Interaction. 6, pp. 136-147 (2005). (www.gespraechsforschung-ozs.de).
  • Hagen Hirschmann: Corpus Linguistics. An introduction . Metzler, Stuttgart 2019, ISBN 978-3-476-05493-7 .
  • Werner Kallmeyer, Gisela Zifonun (Hrsg.): Language corpora - amount of data and progress in knowledge. de Gruyter, Berlin / New York 2007. (= IDS Yearbook 2006).
  • Lothar Lemnitzer, Heike Zinsmeister: Corpus linguistics. An introduction. Gunther Narr Verlag, Tübingen 2006. (= Narr study books).
  • Wilfried Lenders, Gerd Willée: Linguistic data processing - A textbook. West German publishing house, Opladen / Wiesbaden 1998.
  • Anton Näf, Rolf Duffner (Ed.): Corpus Linguistics in the Age of Text Databases (=  Linguistics online . Volume 28 , no. 3 ). July 1, 2006 ( bop.unibe.ch [accessed April 13, 2020]).
  • Rainer Perkuhn, Holger Keibel, Marc Kupietz: corpus linguistics . Fink, Paderborn 2012, ISBN 978-3-8252-3433-1 .
  • Carmen Scherer: Corpus Linguistics. Winter, Heidelberg 2006, ISBN 3-8253-5164-5 .
  • Thomas Schmidt: Data Archives for Conversation Research: Perspectives, Problems and Approaches to Solutions. In: Conversation Research - Online Journal for Verbal Interaction. 6 (2005). Pp. 103-126. ( www.gespraechsforschung-ozs.de ).
  • P. Wagener, K.-H. Bausch (Ed.) (1997): Sound recordings of spoken German. Documentation of the holdings of linguistic research projects and archives. Niemeyer, Tübingen 1997 (= Phonai Volume 40).

Web links

Wiktionary: Text corpus  - explanations of meanings, word origins, synonyms, translations

Individual evidence

  1. An overview of this is provided, for example, in the introduction to corpus linguistics by Scherer (2006).
  2. For example, in a phonetic study: “… our corpus consisted of monosyllabic words spoken in isolation by two males and one female.” (German: “… our corpus consisted of monosyllabic words, used by two males and one female person under isolation conditions have been spoken. ") (M. Halle, GW Hughes, J.-PA Radley: Acoustic Properties of Stop Consonants , Journal of the Acoustical Society of America, Vol. 20 (1967); printed in: Ilse Lehiste (ed.) : Readings in Acoustic Phonetics , second printing, MIT Press, Cambridge (Mass.) 1969, ISBN 0-262-12025-9 , p. 171.)
  3. For example, John Sinclair analyzes the meaning of the English word "(to) yield" or categorizes the noun constructions with "of" as they occur in "bottle of wine". (John Sinclair: Corpus, Concordance, Collocation. 4th Impression. Oxford University Press, Oxford 1997, ISBN 0-19-437144-1 .)
  4. For example, Noah Bubenhofer examines how names for ethnic groups or the term “terrorism” are actually used in the “Neue Zürcher Zeitung”. (Noah Bubenhofer: Sprachgebrauchsmuster. Corpus linguistics as a method of discourse and cultural analysis. De Gruyter, Berlin 2009, ISBN 978-3-11-021584-7 .)
  5. Ruth Esterhammer: The German dictionary of variants: From the idea to the finished product. In: Rudolf Muhr, Manfred B. Sellner (Eds.): Ten Years of Research on Austrian German: 1995–2005. A balance sheet. Peter Lang, Frankfurt am Main 2006, ISBN 3-631-55450-8 , pp. 65-78.
  6. ^ The German reference corpus - DeReKo. Expansion and maintenance of the corpora of written contemporary language. In: Digital Linguistics. Institute for the German Language, March 2019, accessed on May 3, 2019 .