Information integration

Under Information Integration means the merging of information from multiple databases (data sources) with different data structures in general in a common uniform data structure.

Above all, heterogeneous sources should be brought together as completely and efficiently as possible into a structured unit that can be used more effectively than would be possible with direct access to the individual sources. Information integration is particularly necessary where several systems that have grown over time are to be connected to one another, for example when merging companies, work processes and applications or when searching for information on the Internet .

The integration of more complex systems did not move into the focus of computer science research until the 1990s and is therefore in the process of development.

history

The rapid development in database technology since the 1960s created the need to share and combine existing data. This combination can take place at a variety of levels in the database structure. A popular solution is based on the principle of the data warehouse , which extracts the data from heterogeneous sources, transforms it and loads it into a standardized system.

Since 2009 the trend of information integration has been moving towards standardized query interfaces to query the data in real time. This allows the data to be queried directly from the heterogeneous sources, which provides an advantage in terms of the topicality of the data, but requires increased access times. Since 2010, some research in this area has dealt with the problem of semantic integration . This is less concerned with the structure of the architecture of different databases than with the solution of semantic conflicts between heterogeneous data sources. For example, if two companies want to unify their databases, certain concepts and definitions, such as "revenue", may have different meanings. Approaches in this direction include the use of ontology and benchmarking .

The models for data processing that have existed since 2011 lead to data isolation in the form of data islands of scattered data. These islands are an unwanted artifact due to the data modeling methodology, which leads to disparate data sets. To counteract this problem, methods were developed to avoid data isolation artifacts and to integrate them into the data structure.

Methods

The integration of heterogeneous information from different sources affects both the integration of specific data and the structures ( schemas ) in which they are available. First of all, the local schemas usually have to be integrated ( schema integration ), for which (partially) automatic procedures can also be used (schema matching). For the subsequent integration of the data, methods of data fusion and duplicate detection are necessary.

Examples of available technologies to integrate information include similarity analyzes , which allow the acquisition of similar text in different sources via fuzzy string search .

Opportunities and goals

Information integration becomes significant in a number of different situations, both commercial and scientific. Examples of the practical application of information integration can be found in the integration of product information from manufacturer information and the retrieval of this information by product search engines or in the evaluation of various geological data sets to determine cross-border surface properties.

In the case of redundancy between the data from different sources ( extensional redundancy ), affiliations can in some cases be automatically determined and used to complete data sets ( data fusion ). For example, the entries in a telephone list and an employee directory can be combined if personal names match. Since more information about individual objects is available, it is also called compression .

The aim of the integration is to enable a consistent global view of all data sources. Redundant data sources can be used for verification. The combination of intensional redundant sources leads to a higher coverage ( coverage ) and the complete set of records in extensional redundancy of sources to a higher density ( Density ).

Materialized vs. Virtual integration

Basically, two types of integration can be distinguished:

Materialized or physical integration : data from different data sources - usually with different data structures - are transformed into the target structure and copied into a central database, where they are then available for analysis. This principle can be found, for example, in data warehouses or in the data exchange project of the Open Archives Initiative .

Virtual or logical integration : the data remains in the various sources and integration only takes place when a request is made ( federated information system ).

In comparison, the following advantages and disadvantages result

Timeliness : With materialized integration, the timeliness of the data results from the time interval between the data updates from the sources; a virtually integrated system, on the other hand, is always up to date, since the data is integrated at the time of the request.
Response time : Since all data is held centrally in a materialized system, it can be stored in an optimized manner for fast response times. In the case of virtual integration, the response time depends heavily on the availability of the data management system and the access speed to the source data, the transmission paths and the additional tasks that take place, such as data transformation (mapping) and data cleansing .
Flexibility : As large data stores, materialized systems are usually more difficult to maintain than virtually integrated systems, in which the maintenance of the data is the responsibility of the sources. In addition, adding a source can affect the entire integration ( global-as-view ), while with virtual integration adding, removing or changing a source only affects its mapping to a global schema ( local-as-view ).
Autonomy of the data sources : With materialized as well as virtual data integration, there is no direct influence on the data sources, for example their structure remains unchanged. Due to the necessary access, however, the requirements placed on them, such as accessibility and performance, can change; virtual data integration seems to have a stronger influence here, since with physical integration the access could, for example, take place specifically at times with generally weaker workloads.
Hardware requirements : Materialized integration usually requires the procurement of dedicated hardware.
Data quality : With materialized integration, there is generally more time available to transform the data, which means that more complex analyzes are possible compared to virtual data integration - the data quality that can be achieved is therefore higher.

Integration architectures

Materialized integration architectures

In the case of materialized systems, data is imported from the sources, adjusted and stored centrally. The data available in the source systems are usually not changed.

Data warehouses (DWH) : are the most important representatives of materialized database systems. The data required for the information needs of a company are stored persistently directly in a central data warehouse in order to enable a global, uniform view of the relevant data. In order to integrate the source data into the DWH basic database, an integration layer must be implemented for this purpose ( ETL process ).

Operational Data Stores (ODS) : While data warehouse systems are primarily adapted to the requirements of corporate management and the available information is used for strategic decision-making processes, "Operational Data Stores" make the integrated data available to operational business processes. This already implies that the data stored in a central data warehouse should be used "operationally", i. H. Once the integration has been completed (import, adjustment, storage), this data is subject to changes. This is why the focus of ODS systems is not historical, but primarily current data. In this respect, there is another essential distinguishing feature from DWH, since the synchronization to the source data has to take place either in the case of queries or at least at frequent, regular intervals. ODS are mostly used by companies in those business areas in which the timeliness of the data plays an essential role, such as B. in customer and supplier communication areas and in warehouse management processes. With the trend towards real-time data warehouses and more powerful database management systems , the operational data store is likely to merge with the data warehouse.

Virtual integration architectures

In contrast to materialized systems, data in virtual database systems is not stored in the integrated system itself, but remains physically in the data sources and is only loaded into the integration system in response to queries (virtual data storage).

Federated database systems (FDBS) : At the center of a federated database system is a "global conceptual" (= canonical) schema. On the one hand, this schema represents the interface to the local, distributed databases and their local schemas and, on the other hand, offers requesting applications an integrated global view of the federated source data by means of suitable services. FDBS are mostly created by combining several database systems (multi-database systems) with the aim of a "central" (federated) coordination of common tasks.

Mediator-based information system & wrapper (MBS) : Mediators act as "intermediaries" between data sources and applications. The mediator receives queries from the application and answers them by communicating with the relevant data sources. This already implies a great deal of knowledge about the structure of all federated data sources with regard to schemas and possible inconsistencies of the connected entities. In contrast to federated database systems, however, mediator-based information systems only offer read access to the integrated systems. Mediator-based systems in connection with wrappers already represent a specific software version of middleware . In principle, mediators can also be used as part of a materialized information system, e.g. as an intermediary between the integration layer (or the central data warehouse) in order to overcome the heterogeneity of the connected source systems . However, since the essential characteristic of materialized systems, a central data warehouse, is missing in mediator-based systems, they are assigned to the virtual information architectures.

Peer data management systems (PDMS) : The last integration system relevant in practice is peer data management systems. The internal structure of a peer component is defined as follows:

Peers can manage one or more "own" data warehouses.
There are schema mappings between your own data structures and structures of other peers available, through which data elements can be related to one another.
Each peer provides an export scheme or functions for communication with connected components. Peers function as independent, autonomous components that try to answer queries both with their own databases and with data or query results from other connected peers.

literature

Ulf Leser, Felix Naumann, Information Integration . dpunkt, 2007, ISBN 978-3-89864-400-6 .
Stefan Conrad: Federated database systems. Data integration concepts. Springer, 1997, ISBN 3-540-63176-3 .
M. Tamer Özsu, Patrick Valduriez: Principles of Distributed Database Systems . Prentice Hall, 1999, ISBN 0-13-659707-6 .

Individual evidence

↑ Shubhra S. Ray et al: Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast . In: IEEE Transactions on Biomedical Engineering . tape 56 , no. 2 , 2009, p. 229-236 , doi : 10.1109 / TBME.2008.2005955 .
↑ Duane Nickull: Modeling Method to Harmonize Disparate Data Models . 2003.
↑ Michael Mireku Kwakye: A Practical Approach To Merging Multidimensional Data Models . 2011.
↑ Rapid Architectural Consolidation Engine - The enterprise solution for disparate data models. iri (en), 2011.
^ Dave L. Hall, James Llinas: Introduction to Multisensor Data Fusion. In: Proc. of IEEE. Vol. 85, no. 1, Jan 1997, pp. 6-23.
↑ Scott Weidman, Thomas Arrison: Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. National Research Council 2010, ISBN 978-0-309-15443-7 .
↑ Bertram Ludäscher among others: Managing Scientific Data: From Data Integration to Scientific Workflows. (pdf) sds.edu (en)

[1] Shubhra S. Ray et al: Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast . In: IEEE Transactions on Biomedical Engineering . tape 56 , no. 2 , 2009, p. 229-236 , doi : 10.1109 / TBME.2008.2005955 .

[2] Duane Nickull: Modeling Method to Harmonize Disparate Data Models . 2003.

[3] Michael Mireku Kwakye: A Practical Approach To Merging Multidimensional Data Models . 2011.

[4] Rapid Architectural Consolidation Engine - The enterprise solution for disparate data models. iri (en), 2011.

[5] Dave L. Hall, James Llinas: Introduction to Multisensor Data Fusion. In: Proc. of IEEE. Vol. 85, no. 1, Jan 1997, pp. 6-23.

[6] Scott Weidman, Thomas Arrison: Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. National Research Council 2010, ISBN 978-0-309-15443-7 .

[7] Bertram Ludäscher among others: Managing Scientific Data: From Data Integration to Scientific Workflows. (pdf) sds.edu (en)