Data warehouse

from Wikipedia, the free encyclopedia

A data warehouse ( DWH or DW for short ; literally "data warehouse") is a central database optimized for analysis purposes that brings together data from several, usually heterogeneous sources. The term comes from information management in business informatics .

term

In the mid-1980s, the term information warehouse was created at IBM . The term data warehouse was first used in 1988 by Barry Devlin .

There is currently no standard definition for the term “data warehouse”. However, the following largely applies:

  • A data warehouse enables a global view of heterogeneous and distributed data stocks by merging the data relevant to the global view from the data sources into a common, consistent database. This enables convenient data access.
  • The content of a data warehouse is created by copying and processing data from different sources.
  • A data warehouse is usually the basis for the aggregation of operational key figures and analyzes based on them, the so-called Online Analytical Processing (OLAP).
  • A data warehouse is often the starting point for data mining .
  • The applications often work with application-specific extracts from the data warehouse, the so-called data marts .

Differences in the definitions can mainly be found in the general purpose of a data warehouse and in the scope and handling of the data in the data warehouse.

  • The range of definitions begins with Bill Inmon's restrictive view :
    "A data warehouse is a topic-oriented, integrated, chronologized and persistent collection of data to support management in its decision-making processes."
    In the original: "A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decision-making process."
    The definition according to Inmon can be interpreted as follows:
    • subject-oriented: The selection of the data to be transferred to the data warehouse is made according to certain data objects (product, customer, company, ...) that are relevant for the analysis of key figures for decision-making processes, but not according to operational processes
    • integrated (standardization): The data, which are structured differently in different (operative) source systems, are stored in the data warehouse in a standardized form.
    • time-variant (time orientation): analyzes of changes and developments over time should be made possible in the data warehouse; therefore long-term storage of the data in the data warehouse is necessary (introduction of the "time" dimension).
    • nonvolatile (persistence): data is stored permanently (non-volatile).
  • The following definitions by Bauer and Günzel, as well as those by Kimball, are less restrictive, but are geared towards a special purpose, the analysis function:
    "A data warehouse is a physical database that represents an integrated view of (any) data to enable analysis."
    "A data warehouse is a copy of transaction data specifically structured for querying and reporting."
  • The range of definitions ends with the definition of toe, which is without restrictions on the scope and handling of the data and without a purpose:
    "A data warehouse is a physical database that enables an integrated view of the underlying data sources."

The "physical" restriction is necessary to separate the data warehouse from the "logical" federated database system .

Operation (data warehousing)

The creation of a data warehouse is based on two guiding principles:

  1. Integration of data from distributed and differently structured databases in order to enable a global view of the source data and thus comprehensive evaluations in the data warehouse.
  2. Separation of data that is used for operational business from data that is used in the data warehouse for tasks such as reporting, decision support, business analysis, controlling and corporate management.

The overall process of data acquisition, administration and evaluation of a data warehouse is also known as data warehousing . Data warehousing includes:

  • Data acquisition, data integration ( staging ) and further processing in the ETL process
  • Data storage, i.e. long-term storage of the data in the data warehouse (see also long-term archiving )
  • Data evaluation and analysis
  • Supply and data storage of the separate databases required for the analysis, the data marts .

In the data marts, the data is often stored as multi-dimensional matrices in the so-called star scheme or in related data schemes such as snowflake and galaxy schemes .

In recent years, there has been an increasing shift away from regular loading towards real-time loading of the data warehouse (real-time data warehousing). Some industries, such as the telecommunications industry and retail, had a need for data that was immediately available while maintaining the separation of operational and evaluation systems. Real-time data warehousing also enables the results to be played back directly into the operational systems. In this way, the results of analysis of data from the data warehouse influence the operational systems that feed the data warehouse (closed loop).

Applications

Excerpt from the data warehouse system
  • Integration of data from differently structured and distributed databases to enable a global view of the source data and thus comprehensive evaluations
  • Finding hidden relationships between data through data mining
  • Fast and flexible availability of reports, statistics and key figures, for example to identify relationships between the market and the range of services
  • Comprehensive information about business objects and relationships
  • Transparency over time on business processes, costs and use of resources
  • Provision of information, for example for the creation of product catalogs.

Risks

The 59th Conference of the Federal and State Data Protection Officers from 14./15. March 2000, in its resolution on data warehousing, data mining and data protection, points out the legal risk associated with these practices.

In particular, the basic right to informational self-determination and the protection of privacy is at risk.

The reason is the possibility of the above-mentioned procedures to store and use personal data beyond its purpose, which is illegal under certain circumstances. The recommendation is to rely on techniques that use an anonymized or pseudonymized form of the originally personal form of the data.

literature

  • Jan Holthuis: The structure of warehouse systems, concept, data modeling, procedure. Deutscher-Universitäts-Verlag, ISBN 3-8244-6959-6 .
  • Ralph Hughes: Agile Data Warehousing for the Enterprise . Morgan Kaufmann, Waltham, Massachusetts 2015, ISBN 978-0-12-396518-9 .
  • Hans Hultgren: Modeling the Agile Data Warehouse with Data Vault . Brighton Hamilton, Denver et al. 2012, ISBN 978-0-615-72308-2 .
  • William H. Inmon, Richard D. Hackathorn: Using the Data Warehouse. John Wiley & Sons, ISBN 0-471-05966-8 .
  • Reinhard Jung, Robert Winter: Data Warehousing Strategy. Springer, ISBN 3-540-67308-3 .
  • H.-G. Kemper, W. Mehanna, C. Unger: Business Intelligence. Basics and practical application. Vieweg, ISBN 3-528-05802-1 .
  • Ralph Kimball , Margy Ross: The Data Warehouse Toolkit. The Definitive Guide to Dimensional Modeling . 3. Edition. Wiley, New York 2013, ISBN 978-1-118-53080-1 .
  • Wolfgang Lehner: Database technology for data warehouse systems. Concepts and Methods. dpunkt, ISBN 3-89864-177-5 .
  • Daniel Linstedt, Michael Olschimke: Building a Scalable Data Warehouse with Data Vault 2.0 . Morgan Kaufmann, Waltham, Massachusetts 2016, ISBN 978-0-12-802510-9 .
  • Dani Schnider, Claus Jordan and others: Data Warehouse Blueprints. Business intelligence in practice . Hanser, Munich 2016, ISBN 978-3-446-45075-2 .
  • Reinhard Schütte: Data Warehouse Management Manual. Concepts, software, experiences. Springer, ISBN 3-540-67561-2 .
  • J.-H. Wieken: The way to the data warehouse. Addison-Wesley, ISBN 978-3-8273-1560-1 .

Individual evidence

  1. Erhard Rahm: Data Warehouses. (PDF) Introduction. P. 2 , accessed on September 29, 2015 (lecture script, University of Leipzig).
  2. ^ Barry Devlin: Data Warehouse. From architecture to implementation . Addison-Wesley, ISBN 0-201-96425-2 .
  3. P. Mertens et al .: Fundamentals of Information Systems . 5th edition. Springer Verlag, Berlin 1998, p. 72 .
  4. William H. Inmon: Building the Data Warehouse. John Wiley & Sons, 1996, ISBN 978-0-471-14161-7 , page 33
  5. a b Andreas Bauer, Holger Günzel: Data Warehouse Systems: Architecture, Development, Application. dpunkt, 2013, ISBN 3-89864-785-4
  6. ^ A b Ralph Kimball, Mary Ross: The Data Warehouse Toolkit. The Complete Guide to Dimensional Modeling. Wiley, ISBN 0-471-20024-7
  7. Thomas Zeh: Data warehousing as an organizational concept of data management. A critical look at Inmon's data warehouse definition. In: Computer Science - Research and Development. 18, No. 1, 2003.