German reference corpus

from Wikipedia, the free encyclopedia

The German Reference Corpus ( DeReKo for short ) is an electronic archive of German-language text corpora of written language that has existed since 1964 and is maintained and continuously expanded by the Institute for German Language (IDS) in Mannheim . With currently over 43 billion words (as of May 2019), DeReKo is the world's largest collection of electronic corpora of contemporary German that is intended for scientific purposes. DeReKo is publicly accessible to registered users via the free web application COSMAS II .

Alternative names

The German reference corpus is often referred to under other names, e.g. a. these are the names Mannheimer Korpora , IDS-Korpora , COSMAS-Korpora , archive of the corpora of written contemporary language at the IDS . The designation German reference corpus (DeReKo) was originally only used for part of today's archive, which was built between 1999 and 2002 in a project of the same name in which several institutions were involved. Since 2004 it has been the official name for the entire corpus archive.

Conception and composition

The German reference corpus contains fiction, scientific and popular science texts, a large number of newspaper texts and various other types of text. The texts cover the period from the middle of the 20th century to the present day.

In contrast to some other well-known corpora and corpus archives (such as the DWDS core corpus or the British National Corpus ), however, the German reference corpus is expressly not designed as a balanced corpus : the texts are neither distributed to the individual text types according to certain specified percentages evenly distributed over the period covered.

This conception follows the fact that, in principle, it is only possible to assess whether a corpus represents a balanced or even representative sample with reference to a fixed language excerpt (i.e. a fixed population ) . However, different linguistic questions can relate to very different language excerpts - in this respect, the German reference corpus is designed as a kind of original sample for the use of the written German language, from which a balanced sample can be put together depending on the question and the associated population. Such a corpus compiled from texts from an existing corpus archive is also referred to as a virtual corpus .

Access

Due to copyright and licensing regulations, the DeReKo archive may not be copied and, in particular, may not be offered for download. It can be researched and analyzed via the COSMAS II interface, whereby users must register by name and commit to purely scientific and non-commercial use. COSMAS II offers users u. a. the possibility of compiling and using a virtual corpus that fits your question from the German reference corpus .

Around 37,000 users from 110 countries worldwide are currently registered for COSMAS II and can carry out scientific research and analyzes in DeReKo.

literature

Web links

Individual evidence

  1. ^ The German reference corpus - DeReKo. Expansion and maintenance of the corpora of written contemporary language. In: Digital Linguistics. Institute for the German Language, March 2019, accessed on May 3, 2019 .
  2. COSMAS II - Registration , Institute for the German Language, accessed November 16, 2018
  3. COSMAS II - Overview of the portal , Institute for the German Language, accessed November 16, 2018