Part-of-speech tagging

Under part-of-speech tagging ( POS tagging ) means the allocation of words and punctuation of a text to speech ( English part of speech ). For this purpose, both the definition of the word and the context (e.g. adjacent adjectives or nouns ) are taken into account.

Procedure

The recording and labeling of parts of speech was originally carried out manually, but over time the process has been increasingly automated by computer linguistics . The methods used can be divided into supervised machine learning and unsupervised machine learning . In supervised learning z. B. Hidden Markov Models or Eric Brill's method or decision trees (after Helmut Schmid) are used, and all part of speech tags come from a predefined so-called tag set. POS tagging is language dependent. The Stuttgart-Tübingen-Tagset (STTS) is often used for German. In the case of unsupervised learning, the tag set is not fixed in advance; it is created through a stochastic process.

principle

The sentence Petra reads a long novel. is tagged with the Stuttgart-Tübingen-Tagset (short: STTS) as follows:

Petra / NE / VVFIN reads a / ART long / ADJA novel / NN ./$.

After each word or punctuation mark, there is the tag after a slash. In order to correctly tag the word one in the given context, one has to distinguish it from the forms of the verb of the same name ; these would be tagged with VVINF (for the infinitive ) or VVFIN (for the finite form ).

In supervised learning, the tag is selected for one with the help of the context: From an already tagged text corpus , e.g. B. the probabilities for the tag sequences VVFIN-ART, VVFIN-VVINF and VVFIN-VVFIN are calculated (so-called training of the tagger). Since VVFIN-ART is significantly more common than the other two episodes, one in this sentence is tagged as ART. (The frequent sequence can read is not tagged with VVFIN-VVINF, but with VMFIN-VVINF.)

With unsupervised learning there is no previous training, but from the sentences to be tagged it is calculated that e.g. B. one often reads or stands reading , but also often at the end of a sentence. Den, on the other hand, is often after reading or reading , but never or rarely at the end of a sentence. Reading often comes at the end of a sentence and never reads or reads . Therefore, the tagger generates a part of speech to which e.g. As the part, and another that read contains. One belongs to both parts of speech. That, as in the given set to should be tagged, is given by the same reasoning as for the tagger who has been trained by supervised learning.

software

Computational linguistics (NLP) software is often capable of automated POS tagging. The NLTK software, which is geared towards the education sector, can provide English-language texts with the Penn Treebank tag set as standard. In addition, an individually designed training with the help of suitable text corpora is possible.

POS tagging is language dependent. One or more tagsets can exist per language . The STTS tag set is used by the open source software OpenNLP for German texts and the Penn Treebank tag set for English texts. The PAROLE TagSet developed for 14 European languages is also supported. OpenNLP has a selection of already trained models for these different languages (German, English, Spanish, Portuguese, Danish, etc.). With the help of these models, a text corpus in one of these languages can be automatically provided with the appropriate tags.

TreeTagger is a tool developed by Helmut Schmid at the Institute for Natural Language Processing at the University of Stuttgart . It can be used to automatically add POS tags to texts from approx. 16 different languages. TreeTagger is probably the language-independent tool most frequently used in research in this area.

literature

Eric Brill: A simple rule-based part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLP-92). Pp. 152-155, 1992.
Eugene Charniak: Statistical Techniques for Natural Language Parsing. In: AI Magazine 18 (4): pp. 33-44, 1997.
Hans van Halteren, Jakub Zavrel, Walter Daelemans: Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems. . In: Computational Linguistics 27 (2), pp. 199-229, 2001 (PDF 2.26 MB; 2.4 MB).
Helmut Schmid: Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing 1994.

Web links

Individual evidence

↑ STTS (HU Berlin)
↑ Complete guide for training your own POS tagger with NLTK & Scikit-Learn. In: NLP-FOR-HACKERS. August 21, 2016. Retrieved February 9, 2019 (American English).
^ Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz: Building a large annotated corpus of English: the Penn Treebank. University of Pennsylvania, accessed February 9, 2019 .
↑ CORDIS | European Commission. In: Language Engineering. Retrieved February 9, 2019 .
↑ Two-level Morphology Irish Tags. School of Computer Science and Statistics - Trinity College Dublin , accessed February 9, 2019 .
↑ Apache Stanbol - OpenNLP POS Tagging Engine. Retrieved February 9, 2019 .
↑ OpenNLP Tools Models. Retrieved February 9, 2019 .
↑ Helmut Schmid's homepage. In: Center for Information and Language Processing. Ludwig Maximilians University Munich , accessed on February 10, 2019 (English).
↑ TreeTagger - a language independent part-of-speech tagger | Institute for Natural Language Processing | University of Stuttgart. Retrieved February 10, 2019 .
↑ Imad Zeroual, Abdelhak Lakhouaja: MulTed: A multilingual aligned and tagged parallel corpus . In: Applied Computing and Informatics . December 14, 2018, ISSN 2210-8327 , doi : 10.1016 / j.aci.2018.12.003 (English, sciencedirect.com [accessed February 10, 2019]).

[1] STTS (HU Berlin)

[2] Complete guide for training your own POS tagger with NLTK & Scikit-Learn. In: NLP-FOR-HACKERS. August 21, 2016. Retrieved February 9, 2019 (American English).

[3] Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz: Building a large annotated corpus of English: the Penn Treebank. University of Pennsylvania, accessed February 9, 2019 .

[4] CORDIS | European Commission. In: Language Engineering. Retrieved February 9, 2019 .

[5] Two-level Morphology Irish Tags. School of Computer Science and Statistics - Trinity College Dublin , accessed February 9, 2019 .

[6] Apache Stanbol - OpenNLP POS Tagging Engine. Retrieved February 9, 2019 .

[7] OpenNLP Tools Models. Retrieved February 9, 2019 .

[8] Helmut Schmid's homepage. In: Center for Information and Language Processing. Ludwig Maximilians University Munich , accessed on February 10, 2019 (English).

[9] TreeTagger - a language independent part-of-speech tagger | Institute for Natural Language Processing | University of Stuttgart. Retrieved February 10, 2019 .

[10] Imad Zeroual, Abdelhak Lakhouaja: MulTed: A multilingual aligned and tagged parallel corpus . In: Applied Computing and Informatics . December 14, 2018, ISSN 2210-8327 , doi : 10.1016 / j.aci.2018.12.003 (English, sciencedirect.com [accessed February 10, 2019]).