German reference corpus
The German Reference Corpus ( DeReKo for short ) is an electronic archive of German-language text corpora of written language that has existed since 1964 and is maintained and continuously expanded by the Institute for German Language (IDS) in Mannheim . With currently over 43 billion words (as of May 2019), DeReKo is the world's largest collection of electronic corpora of contemporary German that is intended for scientific purposes. DeReKo is publicly accessible to registered users via the free web application COSMAS II .
Alternative names
The German reference corpus is often referred to under other names, e.g. a. these are the names Mannheimer Korpora , IDS-Korpora , COSMAS-Korpora , archive of the corpora of written contemporary language at the IDS . The designation German reference corpus (DeReKo) was originally only used for part of today's archive, which was built between 1999 and 2002 in a project of the same name in which several institutions were involved. Since 2004 it has been the official name for the entire corpus archive.
Conception and composition
The German reference corpus contains fiction, scientific and popular science texts, a large number of newspaper texts and various other types of text. The texts cover the period from the middle of the 20th century to the present day.
In contrast to some other well-known corpora and corpus archives (such as the DWDS core corpus or the British National Corpus ), however, the German reference corpus is expressly not designed as a balanced corpus : the texts are neither distributed to the individual text types according to certain specified percentages evenly distributed over the period covered.
This conception follows the fact that, in principle, it is only possible to assess whether a corpus represents a balanced or even representative sample with reference to a fixed language excerpt (i.e. a fixed population ) . However, different linguistic questions can relate to very different language excerpts - in this respect, the German reference corpus is designed as a kind of original sample for the use of the written German language, from which a balanced sample can be put together depending on the question and the associated population. Such a corpus compiled from texts from an existing corpus archive is also referred to as a virtual corpus .
Access
Due to copyright and licensing regulations, the DeReKo archive may not be copied and, in particular, may not be offered for download. It can be researched and analyzed via the COSMAS II interface, whereby users must register by name and commit to purely scientific and non-commercial use. COSMAS II offers users u. a. the possibility of compiling and using a virtual corpus that fits your question from the German reference corpus .
Around 37,000 users from 110 countries worldwide are currently registered for COSMAS II and can carry out scientific research and analyzes in DeReKo.
literature
- Kupietz, Marc / Belica, Cyril / Keibel, Holger / Witt, Andreas (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research (PDF; 727 kB). In: Calzolari, N. et al. (eds.): Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010). Valletta, Malta: European Language Resources Association (ELRA), pp. 1848-1854.
- Kupietz, Marc / Keibel, Holger (2009): The Mannheim German Reference Corpus (DeReKo) as a basis for empirical linguistic research (PDF; 488 kB). In: Working Papers in Corpus-based Linguistics and Language Education, No. 3. Tokyo: Tokyo University of Foreign Studies (TUFS), pp. 53-59.
Web links
- Expansion and maintenance of the corpora of written contemporary language - The German reference corpus - DeReKo , description at the Institute for German Language
- COSMAS II - research and analysis system for the German reference corpus and other written corpora
Individual evidence
- ^ The German reference corpus - DeReKo. Expansion and maintenance of the corpora of written contemporary language. In: Digital Linguistics. Institute for the German Language, March 2019, accessed on May 3, 2019 .
- ↑ COSMAS II - Registration , Institute for the German Language, accessed November 16, 2018
- ↑ COSMAS II - Overview of the portal , Institute for the German Language, accessed November 16, 2018