Wrapper (information extraction)

from Wikipedia, the free encyclopedia
QS IT
This article was due to content flaws on the quality assurance side of the computer science editorial added. This is done in order to bring the quality of the articles from the subject area of ​​computer science to an acceptable level. Help to eliminate the shortcomings in this article and take part in the discussion !  ( + )

In the IT sub-area of information extraction, a wrapper is a group of special procedures for the automatic extraction of (semi-) structured data from a certain data source (text). Different wrappers are required depending on the type of data records to be extracted. In connection with feature subset selection, there are also different approaches for selecting an optimal number of feature subsets from the data sets.

General

LR wrapper

An LR wrapper consists of delimiting pairs

foreach

find the next one
find the next one
extract the text in between and save it as the -th value of the tuple

Restrictions:

  • Each must be a "real" suffix of the text before each instance of the target object. Really means that it must appear in front of every instance and must not appear anywhere else. Otherwise wrong tuples will be extracted.
  • Each must be a prefix of the text after each instance of the target object. Otherwise the extraction will be terminated prematurely.

Source:

More wrappers

HLRT wrappers (Head-Left-Right-Tail-Wrappers)
Learn your own delimiter for the head and tail of a document. Before the head and after the tail, all occurrences of are ignored.
OCLR and HOCLRT wrappers
Learn a separate pair of delimiters for each tuple.
N-LR and N-HLRT wrappers
Allow multi-valued and optional attributes

Source:

Wrapper and FSS

The following simple options are available for selection:

Forward selection
Start with an empty set of features and always add the feature that increases the accuracy the most, until the accuracy no longer increases significantly.
Backward elimination
Start with all features and try to remove unsuitable ones.
Simple heuristic search
Add one feature at a time until the accuracy no longer increases significantly.

restrictions

  • No permutations of attributes possible
  • The boundary pairs may not be sufficient to identify texts

To solve these problems, other algorithms for information extraction must be used, such as a non-deterministic, adaptive Mealy automaton (e.g. SoftMealy ) which does not have these restrictions.

Individual evidence

  1. a b Nicholas Kushmerick: Wrapper Induction: Efficiency and Expressiveness. In: Artificial Intelligence. Volume 118, 2000, pp. 15-68.
  2. C.-N. Hsu, M.-T. Dung: Wrapping semistructured web pages with finite-state transducers. In: Proc. Conference on Automatic Learning and Discovery (CONALD-98). 1998.

literature

  • B. Chidlovskii, U. Borghoff, P. Chevalier: Towards sophisticated wrapping of web-based information repositories. In: Proceedings of the Conference on Computer-Assisted Information Retrieval. 1997, pp. 123-155.
  • M. Roth, P. Schwartz: Don't scrap it, wrap it! In: Proceedings of the 22nd VLDB Conference. 1997, pp. 266-275,