Data lineage

Data lineage or data origin (also data provenance or data pedigree , German also data ancestry and family tree ) refers to the question of determining the original data records from which they were created for given aggregated data records in a data warehouse system .

Usually, data is extracted from various sources in a data warehouse system, transformed according to certain rules and made available for analysis (see ETL process ). With data lineage, the opposite route must be described in order to get from analysis results to the sources. For this purpose, the transformations are modeled mathematically in order to determine the associated input values for given output values of a transformation (see also EVA principle ).

Transformations

All processing steps are as transformation models, consisting of an input output produce: . The Lineage of a record output is defined as the subset of the input that in the construction of was involved: . The lineage of a set of records is made up of the lineage of its elements. ${\ displaystyle T}$ ${\ displaystyle E}$ ${\ displaystyle A}$ ${\ displaystyle T (E) = A}$ ${\ displaystyle T '}$ ${\ displaystyle a}$ ${\ displaystyle E '}$ ${\ displaystyle a}$ ${\ displaystyle E '= T' (a, E)}$

All transformations can be divided into three classes. It is assumed that the transformations are stable and deterministic , that is, no new output objects are invented and the output is constant with the same input.

Black box

A black box is a transformation that cannot be used to specify any special properties. Any element of the output can depend on any element of the input. An example of a black box is a function that indicates the deviation from the mean value for each number in a set. The data lineage is therefore the entire input:

{\ displaystyle T '(a, E) = E}

dispatcher

A dispatcher is a transformation that handles elements of the input independently of one another. Each input element can generate any number of output elements (including zero). The lineage of an element of the output of a dispatcher is composed of all elements of the input for which the following applies : too was involved in the transformation : ${\ displaystyle e}$ ${\ displaystyle e}$ ${\ displaystyle a}$

{\ displaystyle T '(a, E) = \ {e \ in E | a \ in T (e) \}}

A special case of a dispatcher is a filter . In a filter, each input element produces either itself or no output at all. The lineage of a filter corresponds exactly to the output:

{\ displaystyle T '(a, E) = a}

.

Aggregator

An aggregator is a transformation in which each input element is involved in at least one output element and the input can be divided into disjoint partitions in such a way that each partition is responsible for exactly one output element. Each element of the output can be clearly assigned to a group of input elements. The lineage of a given output element corresponds to its input partition: ${\ displaystyle a_ {k}}$

{\ displaystyle T '(a_ {k}, E) = E_ {k}}

A special example of aggregators are key-preserving aggregators, in which only input elements with a matching key attribute generate the same output element in which the same key occurs.

Another class of aggregators are context-free aggregators, in which the assignment of an input element to a particular partition is independent of the values of other input elements.

A transformation that maps all input objects to itself (identity) or subjects each input element to a simple calculation (e.g. format conversion) is both a dispatcher and an aggregator and is also known as a filter.

Calculation of the data lineage

The data lineage of a given output can be determined using a tracing procedure if the properties of the transformation are known .

For dispatchers, each element of the input is checked to see whether it generates the output and, in this case, added to the data lineage.
For context-free aggregators, the partitions are created first and then the one that leads to the output is selected. The partions are determined by adding the input elements successively to existing partitions, if the size of the output remains the same as an element.
The keys of the input elements are checked for key-receiving aggregators.
For filters, the data lineage corresponds to the output

For general aggregators or black boxes, the effort for tracing is too great, since power sets of the input elements would have to be formed. Therefore, to effectively determine the data lineage of a transformation, either an explicit tracing procedure must be known or an inverse function must be used. The inverse function of a transformation can only be used as a tracing procedure with aggregators, since it is not necessarily unique.

In order to determine the data lineage for a whole chain of transformations without having to save all intermediate results, the transformations are normalized by combining some of them without losing the special properties (aggregator, dispatcher, filter ...) so that effective tracing is possible. The determination of the optimal sequence for tracing a series of transformations connected in series also depends on the respective cost model.

literature

Yingwei Cui, Jennifer Widom. Lineage Tracing for General Data Warehouse Transformations. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01). 2001.