UIMA

from Wikipedia, the free encyclopedia
UIMA
Basic data

Maintainer Apache Software Foundation
developer IBM now Apache Software Foundation
Publishing year April 30, 2010
Current  version 2.10.0
(July 24, 2017)
operating system platform independent
programming language Java / C ++
category Data mining
License Apache License
UIMA project page

UIMA ( Unstructured Information Management Architecture , German architecture for managing unstructured information ) is a framework for programming data mining applications, i. H. for knowledge extraction.

The UIMA project was started by IBM in 2005 and has been supported by Apache since October 2006 . The aim of the project is to provide a standardized framework for creating applications for processing unstructured information, especially natural language processing (NLP). Unstructured information can be in any format, e.g. B. image or audio data, but texts are the most common information.

The UIMA concept provides for a pipeline to be implemented in which data is first read in, which then goes through various analysis and processing steps and finally is delivered to one or more so-called consumers who process the results, e.g. B. save in a database . In each individual analysis step, the data is provided with certain annotations , i. H. a defined area of ​​the data volume, for example part of the text, is given a comment. Due to the strong modularization in pipeline stages, the individual stages can easily be reused.

An example of a pipeline is a simple application that is supposed to calculate the average number of words per sentence in a text. To do this, a pipeline stage is first required that reads the text, e.g. B. from a file. The second stage iterates through the text and marks all words by finding all positions of spaces in the text. Analogously, the third stage carries out sentence recognition by setting marks from punctuation marks to punctuation marks. These two steps are independent of each other and could therefore also be exchanged. The last pipeline stage now only has to divide the number of marked words by the number of marked sentences and output.

An extension could now be to count the number of verbs per sentence. For this purpose, after the third stage , part of speech recognition would be built in, which annotated each word with an annotation such as “verb”, “noun” etc., and the consumer would instead of the word annotations count the part of speech annotations that correspond to “Verb”; all other parts of the pipeline can be reused. In this application, UIMA takes on the management of the pipeline and the internal representation of the data to be processed including annotations. It also offers the developer all the necessary interfaces for reading in and reading out the information.

UIMA is used in research in particular, but is also increasingly becoming an industry standard. One of the best-known applications of UIMA is its use in the IBM Watson .

Web links

  1. projects.apache.org . (accessed on April 8, 2020).