Data lake

from Wikipedia, the free encyclopedia
Data collection

A data lake in business information systems is a system or repository of data that is stored in raw data format, usually blobs or files. A data lake is typically a single repository for all corporate data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning . A data lake can contain structured data from relational databases (rows and columns), from CSV , XML or JSON-Formats or unstructured data e.g. B. e-mails, documents, PDF files and binary data (image, sound, memory images ).

In addition to the data, generic analysis methods are also stored so that they are also available for the centrally stored data and do not have to be compiled in advance of each analysis process. Hence, data lakes typically require much more storage capacity than data warehouses . In addition, unprocessed raw data is malleable, can be quickly analyzed for a wide variety of purposes, and is ideal for machine learning.

A data Swamp (data swamp) is an unmanaged data-Lake, either for the intended user is not accessible or provides little value. Data swamps arise when adequate data quality and data governance measures are not implemented.

Differences to the data warehouse

In the data warehouse concept, certain data is extracted from the source systems according to the ETL scheme, transformed and loaded into the central data warehouse.

In contrast, in the data lake concept, all data is loaded from the source systems. No data is rejected. The data is stored in an untransformed or almost untransformed state. The data is only processed for data visualization or data analysis. The advantage of the structured, evaluable data of the data warehouse is given up in favor of greater flexibility. The requirements for knowledge of the data structure increase accordingly.

Individual evidence

  1. Freiknecht, Jonas: Big Data in Practice: Solutions with Hadoop, HBase and Hive; Save, process, visualize data . Hanser, Munich 2014, ISBN 978-3-446-43959-7 , pp. 21 .
  2. Chris Campbell: Top Five Differences between Data Lakes and Data Warehouses. Retrieved February 20, 2020 (American English).
  3. Data Lake vs. Data Warehouse: Key Differences - Talend. Retrieved February 20, 2020 .