Temporal data storage

from Wikipedia, the free encyclopedia

In information technology, temporal data storage (also called historization ) means recording the development of the data over time when it is stored in a database .

It is often sufficient to only save the current (today) valid value in a database ; if there is a change, the old data value is simply overwritten. However, if there is a requirement to document all changes, temporal data storage is required. This makes it possible to reconstruct which value was valid at which point in time or - in less frequent cases - will only become valid in the future.

Two types of temporal consideration are relevant for temporal data storage:

example
  • Validity Time : The period of time that a data item is valid in the real world.
    Example: An item priced at € 1.95 increases to € 2.25 on June 1, 2006.
  • Transaction time (also processing time ): The time at which a data element was saved in the database.
    Example: The above price adjustment of the article was edited on May 25, 2006 and included in the database.

In some cases, both types are actually relevant; the term " bitemporal " is also used for this . This applies, for example, to the following question related to the above examples: What price was given to a customer on May 20, 2006 for the purchase of the item, whereby the purchase should not take place until June 15, 2006?

Details on temporal data storage are defined (often in connection with the archiving of data) as requirements for the revision security of information systems - e.g. B. how long changes must be verifiable.

Mapping of temporal data in database systems

The following variants exist for mapping temporal data:

  • Use of temporal databases
    These are database systems in which the system already provides more extensive support for temporal data storage that goes beyond the support of time-related data types . At the moment, however, there are only prototypes for this and there is no commercial database system available that would comprehensively map the requirements of temporal data management.
  • Use of spatio-temporal databases
    These are database systems that are designed not only to support time-dependent data but also to store spatial information. The focus here is on spatial information. Such databases are used, for example, in the field of traffic telematics .
  • Mapping in conventional relational databases
    Since temporal data types are also supported in conventional relational database systems, temporal data can in principle be stored in such databases, with the temporal attributes being mapped as "normal" attributes. The handling of the temporal aspects must be done by the application programs or by a framework used for application development .
  • Mapping in other databases (e.g. object databases )
    For other database systems, in particular also the object-oriented databases, there is no uniformity and distribution comparable to the relational systems, so that a general statement about the mapping of temporal data is not possible. However, most of the aspects outlined below can certainly also be transferred to such databases.

In the following section, general properties of temporal data management are considered, which largely apply to all of the above figures. This is followed by a detailed explanation of the mapping of temporal data in conventional relational databases. Further information on the other mapping variants can be found in the explanation of the database management systems themselves.

Explanation of terms and terminology

The most important terms relating to temporal data storage are explained below. These explanations essentially coincide with the so-called "consensus glossary" (see web links ).

Validity time and transaction time

As initially mentioned, validity period (referred Valid Time ) the time or period when a state of affairs in the modeled representation of the real world applies (the reference world). Both future and past periods can be relevant.

In contrast, transaction time (called Transaction Time ) the date when an issue has been stored in the database and thus the period when this matter was considered correct image of the reference world. In contrast to the validity time, the transaction time can never refer to the future. An interesting example of temporal data can be found in the database of Wikipedia itself, in which a transaction time stamp of the articles takes place in order to be able to reconstruct the versions of an article at different points in time (see version list of this article ).

If both the validity and the transaction time are relevant, one speaks of bitemporal data storage. In this context, it should also be noted that the transaction time can be specified by the system (e.g. the database system), but the validity time must be specified by the user.

The term user-defined time also exists in the consensus glossary . This is intended to separate other time information that represent "normal" attributes in the database (such as a date of birth) from the temporal data management. This designation is unfortunate because the validity period is also determined by the user.

As a snapshot ( snapshot ) is defined as the observation of temporal data at a fixed timing for the valid and possibly also the transaction time. Such a snapshot usually relates to the current time. A database without temporal support only offers such a snapshot, which is why such databases are also referred to as snapshot databases.

Points in time, time intervals and durations

A point in time ( instant ) represents a point on a time axis used. Different granularities can be used (e.g. exact to the day, exact to the second).

Times can be anchored (fixed) or floating. An anchored time specification refers to a specific point in time or a specific interval on the time scale. A duration is a floating time interval.

Time intervals are anchored as periods ( Time Period referred). The term interval without additional information is misleading in this regard, especially because the term is used in SQL for a period of time.

The smallest possible time interval with a certain granularity is called a chronon , for a date it would be a day, for example.

Time stamping

Time stamping is the addition of a time reference to a data attribute or a data line, also known as a tuple in connection with relational databases . A distinction is made between validity, transaction and bit-temporal time stamps.

In general, it is not initially defined how this time reference is technically mapped. This can be a simple time specification that expresses the validity at a single specific point in time. More often, however, the time reference is shown in the form of a time interval. The most general form here is a so-called temporal element, which represents a set of one or more time intervals.

This time reference is consensus glossary as a timestamp ( timestamp called). It is important not to confuse this with the conventional meaning of this term, as this is a much more abstract form of a time stamp.

When filling time stamps, a distinction is made between explicit and implicit time stamps. With explicit time stamping, the time stamp is not supplied by the database system, but has to be supplied explicitly by the application program (or by the architecture used, e.g. a database trigger ). Implicit time stamping is only available for "real" temporal databases , where the database system usually takes on this task. It should be noted that this can only be completely encapsulated in the case of the transaction time; for the validity time, it must also be possible to specify the validity period explicitly, but temporal databases often offer special interfaces and do not handle the time stamp as a normal attribute.

Tuple versus attribute time stamping

When introducing time stamping, the question arises at which level it should be added. Basically, every attribute that does not behave synchronously with another one would have to be versioned on its own, i.e. given its own time stamp. This procedure is known as attribute time stamping.

However, the technical administrative effort for attribute time stamping is considerable, so that all the attributes of a data line (of a tuple ) are often versioned together, although the attributes are not synchronized in terms of time. This is known as tuple time stamping.

The decision as to which method to choose depends primarily on the frequency with which the individual attributes are changed. Typically, attributes with a high change frequency will be versioned individually, whereas attributes with low change frequency will be versioned together.

Temporal normalization ( coalescing )

With tuple time stamping, however, there is inevitably the problem that in evaluations that only evaluate one of the jointly versioned attributes, one receives consecutive periods of time in which the attribute values ​​do not differ. One would then have liked to summarize these periods. Such a summary is called temporal normalization (also coalescing ).

Using tuple time stamping isn't the only reason why temporal normalization is needed. For example, this normalization is also necessary if bit-temporal data is only evaluated separately according to the validity or the transaction time. In addition, such a summary is also desirable when several tuples with time stamps are linked ( join ) in evaluations .

Formally, temporal normalization is understood to be a combination of attribute values ​​of the same type, whenever this is possible according to the data types used for time stamping. When using time intervals for time stamping, successive intervals with the same attribute values ​​are to be combined; when using temporal elements , even all time periods are combined at which a special attribute value occurs.

In the following example, the price and the target inventory are versioned together for the article data. If both the price and the target inventory are evaluated on their own without normalization, four individual intervals would be obtained. The temporal normalization allows two individual intervals to be combined into one.

Item data
 
Item No. Valid from Date of Expiry price Target grout.
4711 03/01/06 05/31/06 1.95 € 1000
4711 06/01/06 07/31/06 € 2.25 800
4711 08/01/06 09/30/06 € 2.25 750
4711 10/01/06 baw € 2.45 750
Price per item
(normalized)
Item No. Valid from Date of Expiry price
4711 03/01/06 05/31/06 1.95 €
4711 06/01/06 09/30/06 € 2.25
4711 10/01/06 baw € 2.45
Target inventory per item
(normalized)
Item No. Valid from Date of Expiry Target grout.
4711 03/01/06 05/31/06 1000
4711 06/01/06 07/31/06 800
4711 08/01/06 baw 750

Mapping in conventional relational database systems

The mapping of temporal data in conventional relational databases is possible, but there is no standardized procedure for implementation, as this depends very much on the respective requirements. The increase in complexity and the disadvantages of mapping temporal data compared to a "conventional" snapshot mapping are considerable, so that there are also various simplifying maps that are possible depending on the situation. The following disadvantages are associated with the support of temporal data:

  • The data volume increases considerably. A cleanup function may be necessary to archive or delete older data.
  • Access to the current value becomes much more complex ( implementation effort , performance ), this is particularly important because in many cases this is by far the most common form of access.
  • Integrity conditions derived from the ER model (without temporal consideration) can no longer be mapped so easily by means of the definition of primary keys and the use of referential integrity .

It is therefore not advisable to support temporal mapping for data for which this is not absolutely necessary due to the requirements. This also means that when the time dependency is fundamentally introduced into a data model, not all relationships should be converted.

In the following, various conceptual aspects of a full mapping of temporal data are presented. Finally, there is also a discussion of various simplifying figures .

Time stamping

As a rule, time intervals are used for time stamping. Since there is no (fixed) time interval as an explicit attribute in relational databases, two time attributes (e.g. of type DATEor TIMESTAMP) must be included in the table definition that define the start and end of the time interval, in the bit-temporal case separately for Validity and transaction time. There are two basic approaches:

  1. Interval representation : addition of two times per data line (start, end)
  2. Point representation : addition of only one time specification that defines the beginning of the line; the end is implicitly determined by the beginning of the temporally following line

Both approaches have their advantages and disadvantages:

  1. Disadvantages of the interval representation :
    • Overlapping rows can be inserted without compromising the integrity of the database. A separate validation is required (e.g. using database triggers ).
    • A special value is required to indicate an indefinite validity. Either NULL or the minimum or maximum possible date can be used for this (see below ).
    • In the event of a change in value, in addition to inserting a new line, it may be necessary to adjust (terminate) the previously valid line.
  2. Disadvantages of the point representation :
    • The determination of the data at a specific point in time (usually the current point in time) is difficult and requires a subquery ( subselect ).
    • The end of validity or an interruption of validity can only be represented by setting all data attributes of the line to empty ( NULL).
    • In the bitemporal case, the point representation cannot be used for both time dimensions (see example).

In the case of the interval representation , you also have the choice of using a closed or a right-hand half-open interval , i. H. the end itself is no longer part of the interval. There are many arguments in favor of the latter variant, as otherwise a chronon would always have to be added to the end when determining whether the intervals are complete, which TIMESTAMPis not possible with the data type independent of the database , for example .

The following example shows the SQL queries for both variants in the case of a validity time stamp. The currently valid prices for articles (identified by ArtNr) are determined. The start of validity is expressed in each case by the column GueltigAb, for the interval representation the end of validity is UngueltigAbdefined by (it is therefore a half-open interval on the right). In addition, in the event of an unlimited expiry, it is assumed that the maximum possible date is entered.

 SELECT ArtNr, Preis
   FROM Artikel AS a
  WHERE a.GueltigAb <= CURRENT_DATE
    AND a.UngueltigAb > CURRENT_DATE
 SELECT ArtNr, Preis
   FROM Artikel AS a
  WHERE a.GueltigAb = (SELECT MAX(GueltigAb)
                         FROM Artikel
                        WHERE ArtNr = a.ArtNr
                          AND GueltigAb <= CURRENT_DATE
                      )

Such a query is even more complex in the bitemporal case. In this example, too, half-open intervals are used on the right-hand side; the transaction time interval is defined by ErfasstAmand GeloeschtAm. With the point representation , it should be noted that it is not possible to add the beginning of the interval as an attribute for the transaction time; an interval must then be used for the transaction time at least so that a meaningful interpretation is possible. In contrast to the above example, the value for the current date (expressed by CURRENT_DATE) is not to be determined here, but at specified times, expressed by the parameters :VorgGueltigkeitand :VorgDatenstand.

 SELECT ArtNr, Preis
   FROM Artikel AS a
  WHERE a.GueltigAb <= :VorgGueltigkeit
    AND a.UngueltigAb > :VorgGueltigkeit
    AND a.ErfasstAm <= :VorgDatenstand
    AND a.GeloschtAm > :VorgDatenstand
 SELECT ArtNr, Preis
   FROM Artikel AS a
  WHERE a.GueltigAb = (SELECT MAX(GueltigAb)
                         FROM Artikel
                        WHERE ArtNr = a.ArtNr
                          AND GueltigAb <= :VorgGueltigkeit
                          AND ErfasstAm <= :VorgDatenstand
                          AND GeloeschtAm > :VorgDatenstand
                      )
    AND a.ErfasstAm <= :VorgDatenstand
    AND a.GeloeschtAm > :VorgDatenstand

In the above example it should be noted that for the end of the transaction time it is assumed that the maximum possible date is entered for the currently valid line. Often, however, it is NULLused instead for this case , since then querying currently valid (undeleted) data is easier ( GeloeschtAm IS NULL). In this case, however, the query for past data statuses becomes more complicated, as in the example above. There the column would have to be GeloeschtAmreplaced by the following construct:

 CASE WHEN GeloeschtAm IS NULL THEN '9999-12-31' ELSE GeloeschtAm END

'9999-12-31' is the maximum possible date, but this value depends on the database system used.

Determine the primary key

In temporal relations, which express evolutions of object states, it is not possible to identify tuples (data lines) without taking the time dimension into account.

When using the point representation for time stamping, the "normal" primary key only needs to be extended by the attribute defining the start of validity. For the example of the article, the start of validity would have to be included in the primary key in addition to the article number.

There are several alternatives when using the interval representation . Either the beginning or the end of the interval can be included in the key. In this decision, what value you define as a representative for an unlimited validity also plays a role:

  • The primary key of a relation is defined on the one hand to ensure uniqueness, on the other hand the definition of the primary key is also a question of optimization, because when accessing the data this key is used as an index to search for the appropriate rows.
  • The currently valid line is often required, especially for transaction time stamps. This is the line that has an unlimited expiration date. This would speak in favor of including the end of validity in the primary key.
  • In conventional relational database systems, the NULLprimary key column cannot be used, which is why the end of validity cannot be included in the key if this mapping variant is selected for an unlimited validity end. As an alternative, the use of the maximum possible current value would then be an option to determine an unlimited expiry date.

It should also be noted that when using the interval representation, none of the variants ensures that the lines do not overlap; this must be checked separately.

For performance reasons, an additional attribute is occasionally introduced as a surrogate key (artificial key) instead of the composite key in temporal relations in order to obtain the shortest possible identification of a data line. In many database systems there is the option of automatically assigning unique identifications for such surrogate keys.

Integrity Checks

As already mentioned, integrity constraints, which are normally covered by the uniqueness of the primary key or the referential integrity , must be secured in some other way for temporal data. The following variants are available for this:

Particularly when using constraints and database triggers, the problem can arise that the integrity can be temporarily violated during an update consisting of several database operations and is only maintained again after all database operations have been carried out in the event of an update. For this, the database system must offer the option of performing the integrity checks only at the end of a transaction .

When using the interval representation , as already mentioned, it is necessary to check whether the lines for an object do not overlap. The following is an exemplary SQL query that returns overlapping entries for rows for articles (identified by ArtNr) from a table with names Artikel, where the validity time interval is expressed by GueltigAband UngueltigAb.

 SELECT * FROM Artikel AS x, Artikel AS y
  WHERE x.ArtNr = y.ArtNr
    AND x.UngueltigAb > y.GueltigAb
    AND y.UngueltigAb > x.GueltigAb
    AND ''Bedingung(en) zum Ausschluss derselben Zeile in x und y''

The latter condition is necessary so that a row is not diagnosed as self-overlapping. How exactly the condition is to be formulated depends on the primary key selected; some database systems support special functions or data types for identifying a table row.

If the primary key ensures that there can only be one entry at a start of validity, a simplified check is also possible in the following way:

 SELECT * FROM Artikel AS x, Artikel AS y
  WHERE x.ArtNr = y.ArtNr
    AND x.UngueltigAb > y.GueltigAb
    AND y.GueltigAb > x.GueltigAb

This approach also has the advantage that the two lines of an overlapping pair are only returned as one result line.

Example: Catalog entry ( Dependent ) references article ( Parent )

Checking the referential integrity in the temporal sense is even more complex if both the referencing table (the dependent table) and the referenced table (the parent table) have a time stamp. It must be checked for each line of the Dependent table whether

  1. an associated parent line that begins before or at the same time exists that overlaps with the period of the dependent line,
  2. there is an associated parent line that ends afterwards or at the same time , which overlaps with the period of the dependent line and
  3. There is always a direct successor for all lines in the parent table whose end of validity is in the period of the respective dependent line (does not coincide with it!) (no gaps where it interferes).

Only if all three conditions are met is there no violation of integrity.

In the example shown, two catalog entries ( dependent ) reference the article ( parent ). There is a catalog entry for the period March 1, 2006 to August 31, 2006 as well as a further catalog entry which is valid from September 1, 2006 and represents a current catalog entry. For all of the time periods required by the calendar entries, it must be ensured that the referenced article exists, with price changes of the article taking place during these time periods (the price does not seem to be shown in the catalog). Both catalog entries therefore each refer to two different lines of the article. Note in this example that the Item table defines both the price and the existence of the item; This means that the item is in the range at the relevant time.

Simplifying illustrations

As an alternative to transaction time stamping, in the event that older data only has to be accessed in very rare cases, the reconstructability of older data constellations can be ensured by logging the database updates. For the implementation of such a logging method, however, the use of central functions for database updates is necessary. In addition, functions are to be provided which reconstruct old database versions by evaluating the log.

With validity time stamping , it makes sense in the case of cyclical data changes to use the identification of the validity period directly instead of the normally used intervals for time stamping (e.g. with an annual change cycle , only the year can be used as an additional primary key component). This procedure also has the advantage that the partitioning concepts made available by some database systems can be used to increase efficiency.

If the temporality is only required to document the data statuses with which a certain evaluation function was carried out, the execution event itself can also be used as a time stamp. It should be noted, however, that this approach becomes questionable as soon as two different evaluation functions of this type exist that include data types of the same type.

Another approach to simplifying access to the current data is to store the history in separate tables. This is particularly interesting with regard to the performance when accessing the current data. This procedure is possible for both the transaction time and the validity time, but for the latter only if no future valid data is to be recorded. In addition, the price is relatively high as all relations have to be duplicated.

Archiving and cleanup

Temporal data storage inevitably leads to a constant increase in the volume of data , since outdated data is deliberately not removed from the database. In this respect, when temporal data storage is introduced, it is necessary in the longer term to think about a method for database archiving in order to enable the operative database to be cleaned up.

Possible options

In contrast to non-temporal data, archiving is not necessary with temporal data storage in order to be able to reconstruct deleted or changed states of an object, since this is possible anyway through the use of a transaction time stamp.

For this reason, the simplest variant, depending on the situation , is sufficient to clean up ( vacuum ) the temporal database. H. a deletion of older data that is no longer required. However, here, too, it is necessary that the cleansing process maintains the consistency of the database, which is also a main aspect of "correct" application-oriented database archiving.

There are, in turn, different methods for application-oriented database archiving. A main classification feature of these variants is the distinction between an independent and an integrating archive. An independent archive is itself a database that is consistent in itself and that can be accessed directly if necessary. An integrating archive, on the other hand, only serves to play the data back into the operative database if necessary ( copy-back or move-back ).

Outsourcing Criteria

It is necessary to define criteria that are as clear and understandable as possible, which define when a data element is removed from the operational database. In the case of bitemporal databases, the transaction time is the first option, i.e. H. For example, all data records for which the end of the transaction time is older than a specified key date are removed from the database and, if necessary, archived.

The main advantage of using the transaction time is that it is assigned by the system and cannot be influenced by the user, so it is not possible for newly recorded data to relate to a period that has actually already been transferred to the archive.

The use of the transaction time is not possible if only the validity time is managed. In addition, in the bitemporal case it is certainly often necessary to use the validity time as a criterion in addition to the transaction time.

Ensure consistency

The cleanup and archiving process must not endanger the consistency of the operational database. In the case of an independent archive, this also applies to the database used for archiving.

Example: The database was cleaned up on July 1, 2006

With regard to temporal data, this means the following:

  • Intervals for the validity or transaction time of an object must not overlap.
  • For relationships between objects, referential integrity must also be maintained in the temporal sense (see also integrity checks )
  • The data should be temporally normalized, i.e. H. a coalescing may have to be carried out.

When removing data from the operational database, it may be necessary to cut up intervals to ensure referential integrity.

The example on the right should make this clear: All data that is valid before July 1, 2006 is to be relocated to the database. This means that the article version with the price of 1.95 € can be completely removed from the database. However, in order to continue to guarantee the referential integrity, the article version must be cut up at a price of € 2.25. The same applies to the catalog entry "Spring / Summer 2006". If, in the case of an independent archive, the integrity is to be preserved in the archive, exactly the cut counterparts of the intervals must be inserted in the archive. This also makes it clear that coalescing is then necessary in the archive database after each further archiving , since the parts of the intervals that were cut off during the previous archiving are archived in the next archiving process.

The most problematic in this context is the re-storage, especially if the re-storage is only carried out partially and not comprehensively for an entire period, and the re-stored data is also left in the archive ( copy-back ). Then some measures have to be taken to ensure consistency, since then, for example, overlaps between the time intervals of an object can occur.

Typical areas of application

The following is a list of typical applications for temporal data storage. However, this list does not claim to be complete.

Data warehouse

A data warehouse is a database that was created primarily to analyze the data and is fed from one or more other systems (mostly operational databases). Typically, a data import is carried out periodically to transfer the data from the operative systems to the data warehouse.

In such a constellation, it is also advisable to add the time dependency when the data is imported periodically. The start of validity is the time of the import. This has the advantage that the operational system is not burdened with the complexity required for temporal data storage, but time-dependent evaluations via the data warehouse are still possible.

Since the import is usually carried out with a constant cycle (e.g. monthly), no intervals have to be used for time stamping, but the validity period can be identified using a single attribute (see also simplified figures ). Furthermore, a dimension table for the time can be set up in the sense of the star scheme, which enables evaluations in the context of online analytical processing (OLAP).

Payroll

A typical case for the requirement of bitemporal data is a payroll . Among other things, the time that an employee belongs to a salary group must be recorded correctly (validity period). In the event of a subsequent correction (transaction time) of an assigned salary group or even just the assignment period of this group, the basis on which a payroll process was operated must be traceable.

Risk management in the banking sector

In particular, due to the regulations defined by the Basel II regulation, credit institutions and financial service providers must be able to clearly document which information (e.g. equity capital and ratings ) was based on which decision was made.

This requires extensive transaction time stamping, in some cases also bit-temporal mapping. The latter is required, for example, for the ratings of a borrower , since on the one hand it must be recorded at what point in time such an assessment was made by a rating agency (evaluation date ). On the other hand, it must also be documented when this new assessment was made known to the credit institution and included in the database.

literature

  • CJ Date , Hugh Darwen , Nikos A. Lorentzos: Time and Relational Theory. Temporal Databases in the Relational Model and SQL . 2nd Edition. Morgan Kaufmann, Waltham, Massachusetts 2014, ISBN 978-0-12-800675-7 .
  • Tom Johnston: Bitemporal Data. Theory and Practice . Morgan Kaufmann, Waltham, Massachusetts 2014, ISBN 978-0-12-408055-3 .
  • Thomas Myrach: Temporal Databases in Business Information Systems . Teubner, Wiesbaden 2005, ISBN 3-519-00442-9 .
  • Richard T. Snodgrass, Christian S. Jensen: Developing Time-Oriented Database Applications in SQL . Morgan Kaufmann, San Francisco 2000, ISBN 1-55860-436-7 ( arizona.edu [PDF; 5.0 MB ]).

Web links

Individual evidence

  1. Myrach 2005, page 23
  2. Revision Table . MediaWiki Database Layout
  3. a b c d e Consensus glossary, definition of user-defined time , chronon , temporal element , timestamp , coalesce
  4. Myrach 2005, pp. 389-392
  5. Myrach 2005, page 63
  6. James Clifford, Abdullah Uz Tansel: On An Algebra For Historical Relational Databases. Two views . In: Shamkant B. Navathe (Ed.): Proceedings of the 1985 ACM SIGMOD International Conference on Management of Data . ACM Press, 1985, ISBN 0-89791-160-1 , pp. 247-265 , doi : 10.1145 / 318898.318922 .
  7. Myrach 2005, page 134
  8. Bela Stantic, John Thornton, Abdul Sattar: A Novel Approach to Model NOW in Temporal Databases. (PDF; 167 kB) 2003, accessed on June 25, 2010 .
  9. Myrach 2005, pp. 158, 164, 173ff
  10. Snodgrass, Jensen, 1999 (PDF; 5.0 MB), page 127ff.
  11. ^ Theory and practice of bitemporal databases and their archiving, Pfister, 2005 ( Memento of September 28, 2007 in the Internet Archive ), page 68fff
This version was added to the list of articles worth reading on May 11, 2007 .