Tree bench (linguistics)

from Wikipedia, the free encyclopedia
Example-tree.png
Example tree for John loves Mary
Quranic-arabic-corpus.png
Hybrid constituency / dependency tree from the Quranic Arabic Corpus

A tree bank ( English Treebank ), also parsed corpus , is a text corpus in which every sentence is parsed , i.e. annotated with a syntactic structure. The term tree bank refers to the fact that the syntactic structure is usually represented as a tree structure .

Tree banks are often created on corpora that have already been annotated with part-of-speech tags . In addition, tree banks are sometimes expanded with semantic or other linguistic information.

Tree banks can be created manually by linguists annotating each sentence with a syntactic structure, but also semi-automatically , so that a parser automatically assigns syntactic structure, which is then checked by a linguist and, if necessary, corrected. In practice, the complete checking and parsing of natural language texts is a labor-intensive process.

Some tree banks follow a particular linguistic theory in their syntactic annotation (e.g. the BulTreeBank with HPSG ), but most are less theory-specific . Nevertheless, two groups can essentially be distinguished: tree banks that annotate the phrase structure (e.g. Penn Treebank or ICE-GB ) and those that annotate the dependency structure (e.g. Prague Dependency Treebank or the Quranic Arabic Dependency Treebank ).

literature

  • Werner Kallmeyer, Gisela Zifonun (Hrsg.): Language corpora - amount of data and progress in knowledge. Walter de Gruyter GmbH & Co KG, Berlin 2007, ISBN 978-3-11-019273-5 .

Web links