ETL process

from Wikipedia, the free encyclopedia

Extract, Transform, Load ( ETL ) is a process in which data from several possibly differently structured data sources are combined in a target database .

ETL: data flow and supporting programs
extraction
the relevant data from various sources
transformation
of the data in the schema and format of the target database
load
of the data in the target database

The process is best known for its use in the operation of a data warehouse . Here, large amounts of data from several operational databases have to be consolidated in order to be stored in the data warehouse.

functionality

The procedure can also be transferred to other databases as a general process of information integration . It is important to bring together heterogeneously structured data from different sources. The process has to run efficiently in order to minimize blocking times at the sources, as well as ensure the quality of the data so that it can be kept completely and consistently in the data warehouse despite possible changes to the sources.

Newer application areas of data warehouses require the accelerated addition of data. The focus of ETL is therefore increasingly directed towards minimizing the latency time until the data from the source systems is available. This requires the process to be carried out more frequently.

In general, a repository is included in all steps , which in particular includes the necessary data cleansing and transformation rules as well as the schema data as metadata and keeps them in the long term.

Most ETL program systems have routines for data profiling . When migrating from legacy systems, the data quality of the source systems is often not foreseeable. This is measured in data profiling. The mapping rules in the transformation must be coordinated to ensure that the target system works after the load.

extraction

During extraction, a section of the data is usually extracted from the sources and made available for the transformation. The sources can consist of different information systems with different data formats and structures . A schema transformation takes place from the schema of the source data to the schema of the work area.

In order to supply the data warehouse with current data, the extraction must take place regularly. This can be done synchronously with the sources or asynchronously. With synchronous extraction, every change to the source system is immediately propagated to the data warehouse. This approach enables the concept of real-time data warehousing, which covers the need for immediately available data while maintaining the separation of operational and evaluation systems. The asynchronous extraction can take place periodically, event-driven or request-driven.

periodically
The source generates extracts of its data at regular intervals, which are regularly queried.
event-driven
The source generates an extract for certain events - for example after a certain number of changes.
request-driven
The source only provides extracts on request.

It should be noted here that access to the source systems should only take place during their "idle time", i.e. after post-processing. The extracts from the sources can be full or partial snapshots or parts of log files in which all changes to the last snapshot are listed.

transformation

The data coming from the differently structured sources, which can be based on different value ranges, must be transformed into a uniform data scheme. The transformation essentially consists of adapting the data to the specified target structures ( schema mapping ) of the main memory. The usually time-consuming data cleansing also falls under transformation . The transformation takes place in a separate work area ( staging area ).

Typical transformations and transformation steps can be divided into two areas:

Syntactic transformations
This is about improving, implementing or correcting the data based on formal aspects. The data is modified according to the syntax required and used in the target system. An example of this is the adaptation of data types (e.g. numerical representation of the day's date YYYYMMDD to the standardized date format of the data type date)
Semantic transformations
The data is checked for content-related aspects and, if necessary, modified and enriched. This includes e.g. B.
  • Elimination of duplicates ( object identification ),
  • Key adaptation (e.g. different country codes to DIN ISO country codes),
  • Adaptation of data values ​​(e.g. different coding of the gender such as 1 (female), 2 (male) to f (female) and m (male)),
  • Conversion of units of measurement (e.g. different volumes such as gallons and hectoliters to liters),
  • Aggregation (e.g. individual sales of a sales product to monthly sales per sales product),
  • Enrichment of the data read from the source systems with additional information. Examples of additional information are externally procured demographic data, unique company identifiers such as D&B number, the so-called DUNS number, and all other data whose combination with the data from your own systems can lead to an informative enhancement of the processed data.

load

When loading, the data from the work area must be brought into the data warehouse. As a rule, this should be done as efficiently as possible, so that the database is not blocked or only blocked briefly during loading and its integrity is preserved. In addition, a version history can be created in which changes are logged so that data can be accessed that were valid at earlier times (see Slowly Changing Dimensions ).

With regard to the integration of the data in the data warehouse, a further schema transformation from the schema of the work area to the schema of the data warehouse is necessary.

Tools / manufacturers

Even if you can implement ETL processes with your own programs, the following reasons speak in favor of using standard tools:

  • Every standard tool supports access to common database systems as well as ERP and file systems .
  • The development is supported by suitable transformations, methods and procedures (such as visualization of the data flow, error handling, scheduling).
  • In most cases, the corresponding requirements are already implemented in the standard tool for high-performance loading. A precise knowledge of the mechanisms of the target systems is usually not necessary.
  • The development and maintenance of the ETL processes are usually easier and cheaper to carry out using standard visualization tools than with systems based on developed programs using programming languages.

Leading manufacturers of programs for data integration: SAS Institute , IBM (Product: Information Server), Informatica (PowerCenter), SAP Business Objects (BusinessObjects Data Integrator), SAP Data Services, Altova (MapForce), Oracle (Oracle Warehouse Builder, Oracle Data Integrator) and Microsoft ( SQL Server Integration Services ). Another provider is Comit with the Data Management Suite (DMS).

The best-known tools in the open source environment are Kettle Pentaho Data Integration , Scriptella ETL , CloverETL , Talend Open Studio and the Perl framework Catmandu , which comes from the library environment .

literature

Individual evidence

  1. LibreCat is an open consortium of initially 3 university libraries that are working on Catmandu.