Scheme transformation and integration

In computer science, scheme transformation and integration refers to the transfer of schemes into one another (transformation) or the combination of several schemes to form a new scheme (integration). Both tasks are important in data migration and information integration (or data integration ). The transformation and integration of schemes can in part be compared to the translation of natural languages and, like this, is often underestimated. A specific mapping of one schema (or several) to another (or several) is called schema mapping and the automatic recognition of such a mapping is called schema matching . However, these terms are not used consistently. In information integration, a distinction is made between schema integration and schema mapping , depending on whether the data of the output schemas are to be merged completely (materialized integration) or only on the basis of requests (virtual integration).

Schema mapping

A schema mapping is a list of correspondences that relate equivalent components of two heterogeneous schemas to one another.

From the mapping, it should be possible to derive transformation rules with which the data from one schema can be transferred as completely as possible into the other schema. This can be done for example with the help of the database language SchemaSQL . In order to determine the concrete transformations for a given mapping, associations must be found within and between the schemes.

Scheme matching

The methods of automatically finding a mapping between two given schemas can be divided into four classes:

Label-based matching
Instance-based matching
Structure-based matching
as well as mixed forms of the methods just mentioned

Label-based match search

The core idea of the label-based match search is to form a cross product from all attribute names of the two schemes to be compared and to determine the similarity of the attribute names for each pair (for example with the Levenshtein distance ). The most similar pairs are then (presumably) matches.

Instance based match search

There are two schemes with the attribute sets and and the respective underlying data. ${\ displaystyle A}$ ${\ displaystyle B}$

The core idea of the instance-based match search is to determine characteristic properties (e.g. length, letter distribution, etc.) of the existing data for each attribute. The cross product of all attributes of the two schemes to be compared is then formed and the similarity with regard to the properties determined is determined for each pair. The most similar pairs are (presumably) matches.

Structure-based match search

There are two schemes with the element sets and (elements can be attributes, relations, etc.). The main idea is to use the (complex) structure of the schemes to find matches. Can be considered z. B. the hierarchy level of the element, the element type or neighborhood relationships. Similarity flooding, for example, can be used to improve the results. ${\ displaystyle A}$ ${\ displaystyle B}$

Mixed forms

In the case of mixed forms, a distinction is made between hybrid approaches, which use several of the basic techniques, and compositional approaches, which use several methods (including hybrid techniques) as a combination and weight the respective results (e.g. with the help of machine learning ).

literature

Ulf Leser, Felix Naumann, Information Integration . dpunkt, 2007, ISBN 978-3-89864-400-6 .