German text archive

from Wikipedia, the free encyclopedia

The German Text Archive (DTA) has been a scientific digital text archive at the Berlin-Brandenburg Academy of Sciences and Humanities since July 2007 and funded by the German Research Foundation. The German Text Archive has set itself the task of digitizing a cross-disciplinary selection of German-language texts from around 1600 to 1900 on the basis of first editions and making them available on the Internet as a linguistically annotated full-text corpus.

Structure and structure of the archive

The declared aim of the German Text Archive is to provide the user with a representative and interdisciplinary selection of digitized German-language texts. In addition to canon-forming literary works, the conception of the German Text Archive places an emphasis on lesser-known and, in particular, non-literary texts. In order to ensure a representative selection of the works, the German Text Archive uses its own selection list based on bibliographies. The archive will also contain a large part of the text corpus of the German dictionary ("Grimm's dictionary"), which is also located at the academy . In a final step, the interdisciplinary members of the Berlin-Brandenburg Academy of Sciences and Humanities were asked to evaluate the list that had now been drawn up and to suggest missing works from their professional perspective.

Under the direction of the German and psycholinguist Wolfgang Klein , an interdisciplinary team of book and information scientists , Germanists , computer linguists and computer scientists as well as a number of student assistants work in the German Text Archive to set up and maintain the holdings .

technical realization

When digitizing the holdings, the DTA works with numerous scientific institutions and libraries, which make the corresponding copies from their holdings available for digitization. Around 600,000 digital images with a total data volume of almost ten terabytes have been made since the start of the inventory. These digital copies form the basis for creating the full texts. Depending on the quality and complexity of the templates, the texts are either recorded and corrected by text recognition software (OCR) developed in-house, or recorded by an external partner using the double-keying process. In a final step, the texts are linguistically indexed using computer linguistic aids.

The texts are available for download both in HTML format and in a TEI-P5 format . Although these are public domain texts that can be used freely and are pure transcriptions, the licensing of the full texts under the CC-BY-NC license, which excludes commercial use , suggests the existence of copyrights (see Copyfraud ).


For a later project phase, the German Text Archive has decided to develop the holdings into an active archive. Private text selection, the setting of persistent bookmarks on text passages and the addition of annotations should be possible for the user. If the personnel and technical requirements are available in the future, the aim has been to grant registered users the right to integrate texts into the DTA independently on the basis of the archive's guidelines.

In addition to linguistic and literary studies, digitization in the German Text Archive also opens up research perspectives on book and communication studies, such as research on the history of typography and publishing.

Web links