Data fusion

from Wikipedia, the free encyclopedia

Data Fusion (Engl. Data fusion ) refers to the consolidation and completion of incomplete records . It is an important part of information integration . With the help of a donor data record, data is added to a recipient data record. The donor data set consists of variables and the recipient data set consists of variables . The variables are therefore present in both data records, while the variables or only in one of the data records. A model for calculating the values from the variables is created on the basis of the donor data set. This model is applied to the recipient record, creating a new, merged record: . The statistical methods used are summarized under the term statistical matching and are z. This is partly related to the process of imputing missing values.

Examples

Data fusion in geostatistics

In geostatistics, the problem often arises that data are available in different locations and these then have to be merged:

Donor record Recipient record Merged data set
place X Y S1 S2 place X Y E1 E2 place X Y E1 E2 S1 S2
α1 10 10 a c β1 15th 15th e G β1 15th 15th e G a c
α2 10 30th a d β2 15th 35 e H β2 15th 35 e H a d
α3 30th 10 b c β3 35 15th f G β3 15th 15th f G a c
α4 30th 30th b d

The result could also be a fully merged dataset:

place X Y E1 E2 S1 S2
α1 10 10 ? ? a c
α2 10 30th ? ? a d
β1 15th 15th e G ? ?
β2 15th 35 e H ? ?
α3 30th 10 ? ? b c
α4 30th 30th ? ? b d
β3 35 15th f G ? ?

The missing values, marked with?, Would have to be determined in one or more data fusion steps.

Data fusion in computer science

While the data records are largely complete with duplicate detection and only show small deviations, several partially incomplete data records have to be combined in data fusion.

Before data from two sources can be merged, they may have to be brought into a common schema ( schema integration ). Attributes that do not exist are filled with NULL (for "no value"). As a rule, a common identifying attribute is also required as an identifier - this can, for example, have been determined beforehand by means of duplicate detection.

Subsumption with the MINIMUM UNION operator

A simple method of data fusion is to merge one data set with another if it lacks more attributes and if it matches the other data set in all available attributes (MINIMUM UNION). The data set with more missing attributes is subsumed by the more complete data set . In the following example, in the case of minimum union, the first data record subsumes the second:

Heinrich Müller from Berlin, age unknown
Heinrich Müller from Berlin, 55 years

Merging with the MERGE operator

The MERGE operator can also be used to merge incomplete data records that are crosswise. For example, the first two of the following records result in the third in a MERGE:

Heinrich Müller from Berlin, age unknown
Heinrich Müller from ??? , 55 years
Heinrich Müller from Berlin, 55 years

The MERGE operator can be expressed in SQL using the COALESCE function, which returns the first available value in a given list.

Conflict resolution

If individual attribute values ​​are not only missing from related data records, but also differ from one another, this is also referred to as data conflicts . Data conflicts can be based on typing errors, different spellings and coding, errors in calculations and automatic text recognition, and outdated data. To clean up data conflicts by means of aggregation, preferences or other conflict resolution functions must be specified (for example the average of various figures). The data records are first grouped according to duplicates (see duplicate detection ) and then aggregated within the duplicates.

Examples of data conflicts between duplicates:

Heinrich Müller from Bärlin, 55 years
Heinrich Müller from Berlin, age 54
Heinrich Mueller from Bärlin, 55 years

See also