Information extraction

from Wikipedia, the free encyclopedia

Under information extraction (Engl. Information Extraction , IE) refers to the engineering application of techniques from the practical computer science , the artificial intelligence and computational linguistics to the problem of automatic machine processing of unstructured information with the aim of knowledge with respect to a defined beforehand domain to win. A typical example is the extraction of information about company mergers ( merger events ), with instances of the relation merge (company1, company2, date) being extracted from online messages . Information extraction is of great importance because a lot of information is available in unstructured (non-relationally modeled) form, for example on the Internet , and this knowledge can be better understood through information extraction.

Information extraction

Information extraction can be viewed from two different perspectives. On the one hand as the recognition of certain information - for example Grishman IE referred to as “the automatic identification of selected types of entities, relations, or events in free text” (Grishman 2003) - and on the other hand as the removal of the information that is not searched for. The latter view expresses a definition by Cardie: "An IE system takes as input a text and 'summarizes' the text with respect to a prespecified topic or domain of interest" (Cardie 1997). In this sense, information extraction could also be referred to as targeted text extraction (cf. Euler 2001a, 2001b). Information extraction systems are therefore always geared towards at least one special subject area, mostly even towards certain areas of interest ( scenarios ) within a more general subject area ( domain ). For example, in the 'business news' domain, a possible scenario would be 'personnel change in a management position'. Neumann makes a further restriction when he writes that the aim of the IE is “the construction of systems” “which can specifically track down and structure domain-specific information from free texts [...]” (Neumann 2001, emphasis added). In this context, it should be noted that such a restriction has consequences for the technical implementation of an information extraction system.

Delimitation of neighboring areas

The independent research area of ​​information extraction must be distinguished from related areas: Text extraction aims at a comprehensive summary of the content of a text (the comprehensive automatic text summary is problematic insofar as even human readers never complete the task of summarizing the most important of a text Will achieve consistency if it has not been specified how important the information is). Text clustering means grouping independent of texts, text classification filing of text into predetermined groups. With information retrieval , the search for documents in a document set (full text search), or - according to the literal meaning - the general object formulated the retrieval of information meant to be (. See Strube et al., 2001). Data mining generally refers to the “process of recognizing patterns in data” (Witten 2000: 3).

Possible applications

In general, two types of application of information extraction can be distinguished: On the one hand, the extracted data can be immediately intended for a human observer. The system developed by Euler (2001a) for test purposes, which forwards information extracted from e-mails as SMS , or a system that displays information extracted from the hits in a search engine , for example the positions offered in job advertisements , falls into this area of ​​application .

On the other hand, the data can be intended for further machine processing, be it for storage in databases , for text categorization or classification or as a starting point for a comprehensive text extraction . If the information you are looking for consists of several pieces of information, the area of ​​application determines certain requirements for the information extraction system. For machine processing, the information must be available in a structured manner, while an unstructured result may be sufficient for further processing directly by humans.

If the information you are looking for does not consist of further individual information, as is the case with the recognition of proper names, such a distinction is superfluous.

Evaluation criteria

To evaluate (evaluate) information extraction systems, the criteria of completeness and precision ( recall and precision ) commonly used in information retrieval or the F-measure determined from these values are used. Another criterion for evaluating the quality of the extract is the proportion of undesired information (fall-out).

Message Understanding Conferences

The development in the relatively young research area of ​​information extraction was largely driven by the Message Understanding Conferences (MUC). The seven MUCs were hosted from 1987 to 1997 by the Defense Advanced Research Projects Agency ( DARPA ) - the central research and development facility of the United States Department of Defense . Predefined scenarios were messages about nautical operations (MUC-1 1987 and MUC-2 1989), about terrorist activities (MUC-3 1991 and MUC-4 1992), joint ventures and microelectronics (MUC-5 1993), personnel changes in the economy ( MUC-6 1995), as well as spacecraft and rocket launches (MUC-7 1997) (Appelt and Israel 1999). Since a standardized output format was necessary for the joint evaluation , a common output template was used from the second MUC , which is why almost all information extraction systems provide a structured output of the extracted information, with the exception of Euler (2001a, 2001b, 2002).

Summary

Information extraction systems can be used for various tasks from the automatic analysis of job advertisements to the preparation of a general text extraction . The systems can deliver structured or unstructured results according to these requirements. Furthermore, the systems can have completely different linguistic depths, from extraction through targeted summary (Euler 2001a, 2001b, 2002) with pure sentence filtering, where only semantic orientation is given in the form of the word list, to systems with analysis modules for all levels of language ( Phonology , morphology , syntax , semantics , possibly also pragmatics ). In some areas, our lack of understanding of how natural language works leads to stagnation in development, but since information extraction is a more restricted task than a complete understanding of the text , in many cases, in the sense of "appropriate language engineering" (Grishman 2003), solutions appropriate to the requirements (perhaps also especially in connection with the neighboring areas). The method developed by Euler (2001a, 2001b, 2002) may serve as an example for this, which in contrast to the systems dominating the IE only delivers unstructured results. In return, it achieves high performance according to the F-measure and only requires little or even minimal annotation effort for the training corpus , which could mean high portability to new domains and scenarios, for example in the form of creating word lists en passant for a text classification .

Individual evidence

  1. Jakub Piskorski, Roman Yangarber: Information Extraction: Past, Present and Future . In: Multi-source, Multilingual Information Extraction and Summarization (=  Theory and Applications of Natural Language Processing ). Springer, Berlin, Heidelberg, 2013, ISBN 978-3-642-28568-4 , pp. 23–49 , doi : 10.1007 / 978-3-642-28569-1_2 ( springer.com [accessed October 12, 2017]).

literature

  • Appelt, Douglas; John Bear, Jerry Hobbs, David Israel, Megumi Kameyama, Mark Stickel, Mabry Tyson (1993) FASTUS: A Cascaded Finite-State Tranducer for Extracting Information from Natural-Language Text , Sri International. February 11, 2006: [1] .
  • Appelt, Douglas & David Israel (1999) Introduction to Information Extraction Technology. A Tutorial Prepared for IJCAI-99 , SRI International. February 11, 2006: [2] .
  • Cardie, Claire (1997) "Empirical Methods in Information Extraction" in AI Magazine , Vol. 18, 4, 65-68. February 11, 2006: [3] .
  • Cunningham, Hamish; Diana Maynard, Kalina Bontcheva, Valentin Tablan, Cristian Ursu, Marin Dimitrov (2003) Developing Language Processing Components with GATE (a User Guide) , University of Sheffield. February 11, 2006: PDF .
  • Euler, Timm (2001a) Information Extraction by Summarizing Machine-Selected Text Segments , University of Dortmund. February 11, 2006: [4] .
  • - (2001b) Extraction of information through targeted summary of texts , University of Dortmund. February 11, 2006: PDF .
  • - (2002) "Tailoring Text using Topic Words: Selection and Compression" in Proceedings of the 13th International Workshop on Database and Expert Systems Applications (DEXA) , IEEE Computer Society Press. February 11, 2006: PDF .
  • Grishman, Ralph; Silja Huttunen, Pasi Tapanainen, Roman Yangarber (2000) “Unsupervised Discovery of Scenario-Level Patterns for Information Extraction” in Proceedings of the Conference on Applied Natural Language Processing ANLP-NAACL2000 , Seattle. 282-289. February 11, 2006: PDF .
  • Grishman, Ralph (2003) "Information Extraction" in Mitkov, Ruslan et al., The Oxford Handbook of Computational Linguistics , Oxford University Press. 545-559.
  • Mitkov, Ruslan (2003) “Anaphora Resolution” in Mitkov, Ruslan et al., The Oxford Handbook of Computational Linguistics , Oxford University Press. 267-283.
  • Neumann, Günter (2001) “Information Extraction” in Carstensen, Kai-Uwe et al. Computational Linguistics and Language Technology. An introduction , Heidelberg, Berlin: Spectrum. 448-455.
  • Portmann, Edy (2008) Extraction of information from weblogs: Basics and possible uses of targeted information searches , Saarbrücken: VDM.
  • Strube, Gerhard et al. a. (Ed.) (2001) Digital dictionary of cognitive science , Klett-Cotta.
  • Witten, Ian & Eibe Frank (2000) Data Mining - Practical Tools and Techniques for Machine Learning , Hanser.
  • Xu, Feiyu; Hans Uszkoreit; Hong Li (2006) "Automatic Event and Relation Detection with Seeds of Varying Complexity", In Proceedings of AAAI 2006 Workshop Event Extraction and Synthesis, Boston, July, 2006.
  • Xu, Feiyu; Hans Uszkoreit; Hong Li (2007) "A Seed-driven Bottom-up Machine Learning Framework for Extracting Relations of Various Complexity", In Proceedings of ACL 2007, Prague, June, 2007. ( PDF ).

Web links