Global matching

In the context of information integration, global matching refers to a process for the automatic mapping of different schemas to one another ( schema matching ). Results from different matching processes are used to actually map attributes of the schemas to be matched to one another.

Disambiguation

Global matching tries to relate the attributes and relations of different schemas (generally two) to one another. Global matching can therefore be understood in two ways.

Global matching is a "normal" matching procedure that matches attributes (entire columns) with one another,

Global matching, on the other hand, is a step that uses the information already obtained to generate actual matches between the attributes (and relations) of two schemas. He tries to select the correct matches as possible , which together form the best possible mapping of the two schemes.

The second interpretation is dealt with below. Since global matching can refer to the results of simple as well as to the results of composite matchers , it cannot be classified in the categorization of schema matching .

Problem

The matchings between any two attributes of two different schemas are assigned a probability by previous matching processes (e.g. edit distance, (artificial) neural networks or similarity flooding ). The difficulty now is to select those pairs of attributes that actually belong together. It is generally not sufficient here to assign the attribute of the second schema with the greatest correspondence probability to each attribute of the first schema. The following problems can occur:

Several attributes of the second schema have the same similarity with regard to an attribute of the first schema. They may even be “real” semantically similar, as the following example shows.

Instead of 1: 1 matches, there are 1: n matches, for example if first name and surname are combined in a different scheme as the “Name” attribute. Exactly the right attributes (possibly even in the correct order and with the correct concatenation operator) have to be selected.

Some attributes may not be allowed to be matched because they have no equivalent in the other scheme.
The sheer combinatorial complexity of the possible pairings makes the calculation take a long time. In addition, the pairings actually found and suggested to the user make it difficult for the user to evaluate the result. This is especially true if there are many false positives among them.

Limits of global matching

Global matching aims to map two schemas to one another and to use different matching methods that have calculated / estimated the similarities between the attributes.

The solutions currently available on the market often only offer 1: 1 mappings and rely on structural or name similarities. So you do not solve all of the problems outlined above .

Workflow

Global matching workflow

The global matching workflow is shown below.

There are two schemes that are to be integrated. In practice, one schema could be the source schema from which the data is to be migrated to a target schema . This is conceivable, for example, when a company is taken over by another company. The schema of the acquired company would be the source schema, the other the target schema. The relations and attributes of interest are selected from the corresponding schemas, for example a product catalog, while customer or sales data is ignored.

In the next step, the selected attributes must be matched. For this purpose, a specific, non-empty set of matchers is selected, which determines the pairwise similarity of the selected attributes. Depending on the type and availability, the matchers use external data such as dictionaries or the type information or foreign key relationships stored in the database.

The results of the matchers are set up in a similarity matrix. If several matchers were used, the individual results must be weighted and, if necessary, normalized. The similarity matrix indicates for each attribute pair (formed from one attribute of both schemas) how well / likely both participants in the pair semantically fit together. The similarity values do not necessarily have to be in the interval [0, 1]. In general, the similarity matrix can not be converted into an identity matrix by rearranging the rows or columns ; Attributes from one scheme are therefore usually assigned several attributes with a certain similarity from the other scheme and vice versa. The similarity matrix can be improved by algorithms such as similarity flooding . A similarity matrix can also be displayed as a list, the entries of which are triples (Attribute1, Attribute2, similarity value).

In order to make the following work more efficient and to reduce the number of false positives , an intermediate step that filters the similarity matrix is conceivable before the next. For example, threshold values can be established for the similarities. Less similarities are considered too improbable and are set to 0. In addition, type information can be included if this has not already been done by the matcher. The intermediate step may take place implicitly if the combination of several matching methods was done by a combining matcher and it then performs the filtering internally.

Mappings are then obtained from the (now possibly thinned) similarity matrix. A mapping is a selection of one-to-one assignments between elements of both sets, whereby there can also be elements that are not assigned to any element of the other set. Since there are generally still no or not only 1: 1 assignments in the similarity matrix, there are several theoretically conceivable mappings. Thus, the similarity matrix generates a multi-mapping may be generated from the plurality of mappings. There are different algorithms for selecting “good” mappings. They are presented in the next section. The mappings generated by the algorithms are then presented to the user.

Algorithms

Some mapping algorithms are briefly explained below. They each try to form pairs from elements of two sets for which meta information about the probability of a match (the similarity matrix) is known.

Stable marriage

The stable marriage algorithm by Gale and Shapely tries to bring two elements together in such a way that they are connected to their counterpart that is most likely to match (i.e. the one with the greatest value), which is not itself connected to another element with an even higher probability is. In other words, there are two groups of people, men and women, with each person having a ranking list of all people of the opposite sex. The people are now married in such a way that no person is married to someone else who does not hold the highest possible position in the preference list of remaining marriage candidates.

The elements here are the small circles that each belong to the set of circles of the same color.

This algorithm is explained using the following example.

The left and right parts show the same similarity matrix.

For the algorithm, the similarity matrix has to be converted into two preference lists, one “from the view of the blue set” and one “from the view of the red set”. The numerical relationships are irrelevant, only the order. The sequence results from a zero-based counting of the element indices (A and D, for example, have index 0, C and F, however, have index 2). The preference lists look like this:

„Präferenzliste der blauen Menge“ „Präferenzliste der roten Menge“
            A: 2 1 0                                      D: 2 1 0
            B: 1 2 0                                      E: 2 1 0
            C: 0 1 2                                      F: 0 1 2

with resolution of the indices:

„Präferenzliste der blauen Menge“ „Präferenzliste der roten Menge“
            A: F E D                                      D: C B A
            B: E F D                                      E: C B A
            C: D E F                                      F: A B C

This conversion is not unambiguous; C's preference list could also have been 1 0 2 . A concrete implementation has to decide on a variant. Another indeterminism remains, since "Ms. D and Ms. E compete for the men in the same order". In other words: the order of the attributes is relevant for the result.

After the preference lists have been created, the actual algorithm can begin. In doing so, the attributes of a set are examined. The "best-fitting" attributes of the other scheme are marked as a match. It can happen that one attribute is already marked, but is at the top of the preference list for another. To resolve this conflict, the preference list of the disputed attribute is looked up to determine which of the two competing attributes is more similar. If it is the one that has already been reserved, nothing changes and an attempt is made to match the “inferior” attribute against the next one in its preference list. However, if the third attribute has a higher-value position in the preference list with regard to the attribute to be matched, this pairing is noted and the old, reserved pairing is deleted.

properties

A stable marriage mapper requires the same number of attributes in the source and target schema and maps the two schemas completely. In doing so, it creates 1: 1 mappings.

It is ambiguous when creating the preference lists and also when handling the pair formation, where the sequence is important. He selects the best possible mapping for each attribute.

Maximum Weighted Bipartite Graph Matching

The similarity matrix can also be understood as a graph that is bipartite (two types of nodes), undirected and weighted. This should be transformed into a graph, which is also incoherent, in which exactly two nodes are connected with each other. All that is required is to select “the correct” edges of the first graph and transfer them to the second graph.

The Maximum-Weighted-Bipartite-Graph-Matching-Algorithm (e.g. implemented by the Hungarian Algorithm ) tries to maximize the sum of the weights of the "activated" edges, which when applied to the marriage problem corresponds to a maximization of the overall satisfaction run the risk of below maximum satisfaction for individuals.

To do this, the values of the similarity matrix must first be converted into distance measures, for example by subtracting the respective similarity from one. Then the row minimum is subtracted from each row, then the column minimum from each column. This inevitably results in zeros. These zeros are covered by one or more imaginary bars, but with as few as possible. If the number of bars matches the number of attributes, the zero contained in them is searched for line by line. If there is only one zero, it defines the match between the two attributes. Row and column are deleted. If there are more than one, it is initially omitted and the process continues with the next line. The process is then repeated for columns.

The implementation of Brian milk is based on the Hungarian algorithm of Harold Kuhn and requires only the similarity matrix as input. It produces the following mapping:

properties

The maximum weighted bipartite graph matching algorithm allows the mapping of schemes of different sizes. However, it does enforce full mappings. In addition, the sum of the weights of the selected edges is maximized.

Royal Couples

Royal Couples was introduced by Marie and Gal as an alternative to the stable marriage algorithm. In order to avoid the constant changes in the list of reserved couples, the attributes that best match each other are simply connected to one another, the "Royal Couple". As a result, there cannot be a pairing that fits better, i.e. the reservation cannot be destroyed. After the two pair participants have been assigned to each other, they are of course no longer paired and can be deleted from the similarity matrix, including the row and column in which they were.

Viewed objectively, however, this is precisely the naive approach that selects and removes the pairs sorted according to their similarity.

properties

Royal Couples does not maximize the overall similarity of the mapping, but does create a stable marriage. In addition, all attributes are mapped and there are only 1: 1 matches.

Dominants matching

Marie and Gal present another algorithm in the same paper that alleviates a problem from Royal Couples and also the Maximum-Weighted-Bipartite-Graph-Matching-Algorithm. With these algorithms, the selection of a pair causes a row and a column of the similarity matrix to be deleted, since the two attributes can no longer be matched from now on. This also removes similarities that are also of relatively high but not maximum value. This opens up a new problem as well as a new perspective. The problem here is that the missing deletion of the rows and columns means that attributes can be matched several times. A higher authority must take over the filtering of the pairs. However, if the multiple occurrences are considered to be wanted, then m: n mappings are obtained.

The algorithm works in such a way that the row and column maxima are marked. The cells of the similarity matrix, which are row and column maximum at the same time, form the correspondences. Ultimately, this algorithm is only a softening of the Royal Couples algorithm, in which not the entire row and column of a found correspondence is deleted, but only the respective cell. However, if you relate this to multiple mappings, this means progress that would not be possible with Royal Couples alone.

properties

Dominants matching allows the mapping of schemas of different sizes. It does not force a complete mapping, since attributes that are only slightly similar to all the attributes of the other schema are not included in the mapping. They have either a row or a column maximum, but not both. m: n mappings are possible.

Royal Groups

The Royal Groups algorithm has two goals. Firstly, the precision is increased, since the number of incorrect correspondences is reduced, and secondly, clusters are formed from attributes that may belong together. These can either be m: n mappings or simple mappings. In the second case in particular, clustering is viewed as a restricted environment for correspondence alternatives. The user can then more easily select the correct correspondence.

Royal Groups is based on the Royal Couples algorithm, but introduces a threshold for similarity. Similarities under this are mentally set to 0 and prevent the corresponding pairing. This eliminates the need to create a full mapping. The advantage of this is that the number of false positives is reduced. The user is relieved because it means that he does not have to manually delete obviously incorrect correspondence.

In addition, slight differences in the similarity values are smoothed out. It seems inappropriate to absolutely prefer a similarity of 0.99 to a similarity of 0.95 when other similarities are noticeably lower. Therefore an additional quantity is introduced, that of the ε-neighborhood. If a similarity maximum is found, all the values in the same row and column are also interpreted as a maximum, the similarities of which are within the area determined by ε to the maximum. This allows the above-described clusters to be formed, called cliques.

The difficulty with Royal Groups, however, is to find suitable parameters. They depend on the matcher used. If it generates values close to 0 for incorrect correspondences and values close to 1 for correct correspondences, the threshold value could be 0.8 and the ε value 0.1. For example, if the maximum is 0.92, attributes with a similarity of 0.86 would be considered equivalent or equally likely and included in the clique.

properties

Royal Groups does not enforce a complete mapping and does not require equal sets of attributes. It can generate m: n mappings.

comparison

The named properties of the algorithms in a table:

criteria	Royal Couples	Stable marriage	MWBG	Dominants matching	Royal Groups
Compulsion to create complete mapping	Yes	Yes	Yes	No	No
Allows m: n mappings	No	No	No	not yet	Yes
Characteristic	simplicity	"Most satisfied" attribute matching	Maximizing overall satisfaction	faster, less recall	"More tolerant" degree of similarity
Best match for sure included	Yes	Yes	No	Yes	Yes
complexity	${\ displaystyle {\ mathcal {O}} (n ^ {2} \ log n)}$	${\ displaystyle {\ mathcal {O}} (n ^ {2})}$	${\ displaystyle {\ mathcal {O}} (n ^ {3})}$	${\ displaystyle {\ mathcal {O}} (n ^ {2})}$	${\ displaystyle {\ mathcal {O}} (n ^ {2} \ log n)}$

Closing remarks

Commercial programs such as the Microsoft BizTalk Mapper or the IBM Rational Data Architect do not support many-to-many mappings. The results of the algorithms used there correspond to the 1: 1 algorithms presented here. By including a threshold value, the results can be filtered and thus the precision increased. The user is then of course left with the question of which threshold value he should specify. The Dominants algorithm frees the user from this. It "hides" false correspondence by itself.

The Royal Groups algorithm, on the other hand, offers a short list of suggestions in many of these "hidden" places. The user can quickly go through them and delete unnecessary correspondence. As a "bonus", m: n mappings are also found. However, the user has to buy this by fine-tuning the threshold value and the epsilon environment, which depend on the matcher used.

Individual evidence

↑ Melnik et al .: Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching. Retrieved August 13, 2019 .
↑ D. Gale and S. Shapely, "College Admissions and the Stability of Marriage." The American Mathematical Monthly 69.1 (Jan., 1962): 9-15.
↑ BLOG Inference Engine ( Memento from June 16, 2011 in the Internet Archive )

[1] Melnik et al .: Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching. Retrieved August 13, 2019 .

[2] D. Gale and S. Shapely, "College Admissions and the Stability of Marriage." The American Mathematical Monthly 69.1 (Jan., 1962): 9-15.

[3] BLOG Inference Engine ( Memento from June 16, 2011 in the Internet Archive )