Wrapper (information extraction)
In the IT sub-area of information extraction, a wrapper is a group of special procedures for the automatic extraction of (semi-) structured data from a certain data source (text). Different wrappers are required depending on the type of data records to be extracted. In connection with feature subset selection, there are also different approaches for selecting an optimal number of feature subsets from the data sets.
General
LR wrapper
An LR wrapper consists of delimiting pairs
foreach
- find the next one
- find the next one
- extract the text in between and save it as the -th value of the tuple
Restrictions:
- Each must be a "real" suffix of the text before each instance of the target object. Really means that it must appear in front of every instance and must not appear anywhere else. Otherwise wrong tuples will be extracted.
- Each must be a prefix of the text after each instance of the target object. Otherwise the extraction will be terminated prematurely.
Source:
More wrappers
- HLRT wrappers (Head-Left-Right-Tail-Wrappers)
- Learn your own delimiter for the head and tail of a document. Before the head and after the tail, all occurrences of are ignored.
- OCLR and HOCLRT wrappers
- Learn a separate pair of delimiters for each tuple.
- N-LR and N-HLRT wrappers
- Allow multi-valued and optional attributes
Source:
Wrapper and FSS
The following simple options are available for selection:
- Forward selection
- Start with an empty set of features and always add the feature that increases the accuracy the most, until the accuracy no longer increases significantly.
- Backward elimination
- Start with all features and try to remove unsuitable ones.
- Simple heuristic search
- Add one feature at a time until the accuracy no longer increases significantly.
restrictions
- No permutations of attributes possible
- The boundary pairs may not be sufficient to identify texts
To solve these problems, other algorithms for information extraction must be used, such as a non-deterministic, adaptive Mealy automaton (e.g. SoftMealy ) which does not have these restrictions.
Individual evidence
- ↑ a b Nicholas Kushmerick: Wrapper Induction: Efficiency and Expressiveness. In: Artificial Intelligence. Volume 118, 2000, pp. 15-68.
- ↑ C.-N. Hsu, M.-T. Dung: Wrapping semistructured web pages with finite-state transducers. In: Proc. Conference on Automatic Learning and Discovery (CONALD-98). 1998.
literature
- B. Chidlovskii, U. Borghoff, P. Chevalier: Towards sophisticated wrapping of web-based information repositories. In: Proceedings of the Conference on Computer-Assisted Information Retrieval. 1997, pp. 123-155.
- M. Roth, P. Schwartz: Don't scrap it, wrap it! In: Proceedings of the 22nd VLDB Conference. 1997, pp. 266-275,