Tree bench (linguistics)
A tree bank ( English Treebank ), also parsed corpus , is a text corpus in which every sentence is parsed , i.e. annotated with a syntactic structure. The term tree bank refers to the fact that the syntactic structure is usually represented as a tree structure .
Tree banks are often created on corpora that have already been annotated with part-of-speech tags . In addition, tree banks are sometimes expanded with semantic or other linguistic information.
Tree banks can be created manually by linguists annotating each sentence with a syntactic structure, but also semi-automatically , so that a parser automatically assigns syntactic structure, which is then checked by a linguist and, if necessary, corrected. In practice, the complete checking and parsing of natural language texts is a labor-intensive process.
Some tree banks follow a particular linguistic theory in their syntactic annotation (e.g. the BulTreeBank with HPSG ), but most are less theory-specific . Nevertheless, two groups can essentially be distinguished: tree banks that annotate the phrase structure (e.g. Penn Treebank or ICE-GB ) and those that annotate the dependency structure (e.g. Prague Dependency Treebank or the Quranic Arabic Dependency Treebank ).
literature
- Werner Kallmeyer, Gisela Zifonun (Hrsg.): Language corpora - amount of data and progress in knowledge. Walter de Gruyter GmbH & Co KG, Berlin 2007, ISBN 978-3-11-019273-5 .
Web links
- Annotation guidelines for the Deutsche Diachronen Baumbank (accessed on October 8, 2015)