Wrapper (information extraction)

This article was due to content flaws on the quality assurance side of the computer science editorial added. This is done in order to bring the quality of the articles from the subject area of computer science to an acceptable level. Help to eliminate the shortcomings in this article and take part in the discussion ! ( + )

In the IT sub-area of information extraction, a wrapper is a group of special procedures for the automatic extraction of (semi-) structured data from a certain data source (text). Different wrappers are required depending on the type of data records to be extracted. In connection with feature subset selection, there are also different approaches for selecting an optimal number of feature subsets from the data sets.

General

LR wrapper

An LR wrapper consists of delimiting pairs ${\ displaystyle n}$ ${\ displaystyle \ langle l_ {i}, r_ {i} \ rangle}$

foreach ${\ displaystyle \ langle l_ {i}, r_ {i} \ rangle \ in \ {\ langle l_ {1}, r_ {1} \ rangle, \ dotsc, \ langle l_ {n}, r_ {n} \ rangle \}}$

find the next one

{\ displaystyle l_ {i}}

find the next one

{\ displaystyle r_ {i}}

extract the text in between and save it as the -th value of the tuple

{\ displaystyle i}

Restrictions:

Each must be a "real" suffix of the text before each instance of the target object. Really means that it must appear in front of every instance and must not appear anywhere else. Otherwise wrong tuples will be extracted. ${\ displaystyle l_ {i}}$

Each must be a prefix of the text after each instance of the target object. Otherwise the extraction will be terminated prematurely. ${\ displaystyle r_ {i}}$

Source:

More wrappers

HLRT wrappers (Head-Left-Right-Tail-Wrappers): Learn your own delimiter for the head and tail of a document. Before the head and after the tail, all occurrences of are ignored. ${\ displaystyle \ langle l_ {i}, r_ {i} \ rangle}$

OCLR and HOCLRT wrappers: Learn a separate pair of delimiters for each tuple.

N-LR and N-HLRT wrappers: Allow multi-valued and optional attributes

Source:

Wrapper and FSS

The following simple options are available for selection:

Forward selection: Start with an empty set of features and always add the feature that increases the accuracy the most, until the accuracy no longer increases significantly.
Backward elimination: Start with all features and try to remove unsuitable ones.
Simple heuristic search: Add one feature at a time until the accuracy no longer increases significantly.

restrictions

No permutations of attributes possible
The boundary pairs may not be sufficient to identify texts

To solve these problems, other algorithms for information extraction must be used, such as a non-deterministic, adaptive Mealy automaton (e.g. SoftMealy ) which does not have these restrictions.

Individual evidence

↑ ^a ^b Nicholas Kushmerick: Wrapper Induction: Efficiency and Expressiveness. In: Artificial Intelligence. Volume 118, 2000, pp. 15-68.
↑ C.-N. Hsu, M.-T. Dung: Wrapping semistructured web pages with finite-state transducers. In: Proc. Conference on Automatic Learning and Discovery (CONALD-98). 1998.

literature

B. Chidlovskii, U. Borghoff, P. Chevalier: Towards sophisticated wrapping of web-based information repositories. In: Proceedings of the Conference on Computer-Assisted Information Retrieval. 1997, pp. 123-155.
M. Roth, P. Schwartz: Don't scrap it, wrap it! In: Proceedings of the 22nd VLDB Conference. 1997, pp. 266-275,

[Kushmerick-1] Nicholas Kushmerick: Wrapper Induction: Efficiency and Expressiveness. In: Artificial Intelligence. Volume 118, 2000, pp. 15-68.

[softmealy_paper-2] C.-N. Hsu, M.-T. Dung: Wrapping semistructured web pages with finite-state transducers. In: Proc. Conference on Automatic Learning and Discovery (CONALD-98). 1998.