Apache OpenNLP

from Wikipedia, the free encyclopedia
OpenNLP

Apache OpenNLP Logo.svg
Basic data

developer Apache Software Foundation
Publishing year April 22, 2004 to April 14, 2012
Current  version 1.9.3
( July 24, 2020 )
operating system Platform independent
programming language Java
License Apache 2.0
opennlp.apache.org

The Apache OpenNLP library is a machine learning- based toolkit in the Java programming language for processing natural language text in the field of computational linguistics or Natural Language Processing (NLP). It supports the most common NLP tasks, such as language identification, tokenization , sentence segmentation, part-of-speech tagging , extraction of named entities , chunking, parsing and resolution of core references . These tasks are usually required to build more advanced word processing services. Free software applies-License from the Apache Software Foundation . The aim of the OpenNLP project is to develop a sophisticated toolkit for the above tasks and to provide a number of ready-made models for different languages.

The included components make it possible to carry out the respective task of language processing, to train a model and often also to evaluate a model. Each of these components can be accessed via its programming interface (API). In addition, they can each be called via the command line (CLI) to facilitate experiments and training.

Details

  • Identification of the language: The "LanguageDetector" requires a trained model. OpenNLP itself offers the fully trained model "langdetect-183.bin" as a download. This is able to identify 103 languages.
  • Sentence detection: The "SentenceDetector" detects whether a point marks the end of a sentence or whether it has another meaning. Here, too, it is necessary to specify a trained model. OpenNLP provides models for various languages, e.g. B. "de-sent.bin" for sentence recognition in German texts.
  • Tokenization: The tokenizer breaks a character string into tokens. Tokens are usually words, punctuation marks, numbers, etc.
  • Part-of-speech tagging: OpenNLP has a selection of already trained models for various languages ​​(German, English, Spanish, Portuguese, Danish etc.). With the help of these models, a text corpus in one of these languages ​​can be automatically provided with the appropriate tags.
  • Extraction of named entities: The "TokenNameFinder" can recognize named objects and numbers in the text. To be able to recognize entities, a model is required. The model is dependent on the language and the entity type for which it has been trained. The OpenNLP project offers a number of pre-trained models that have been trained on various freely available corpora. They can be downloaded from the model download page.

Web links

Individual evidence

  1. sourceforge.net .
  2. projects.apache.org . (accessed on April 8, 2020).
  3. Release 1.9.3 . July 24, 2020 (accessed July 25, 2020).
  4. OpenNLP - Quora. Retrieved February 11, 2019 .
  5. Models Download - Apache OpenNLP. Retrieved February 11, 2019 .
  6. OpenNLP Tools Models. Retrieved February 11, 2019 .
  7. Apache Stanbol - OpenNLP POS Tagging Engine. Retrieved February 11, 2019 .