Denormalization

from Wikipedia, the free encyclopedia

Under denormalization means the conscious withdrawal of normalization for the purpose of improving the runtime behavior of a database application . From the point of view of the ANSI-SPARC architecture , the conceptual level of a data model is designed to be completely normalized. Regardless of this, the internal level can be designed to be deliberately denormalized. Denormalization therefore takes place exclusively on the internal level and does not release you from the requirement to normalize the conceptual level beforehand.

A logically ideal ("normalized") data model is completely redundant - apart from the technically necessary multiple storage of foreign keys in primary key - foreign key relationships.

With denormalizations, significantly greater performance improvements can often be achieved than with tuning the database installation .

In addition to improving runtime behavior, denormalization is also used to reduce the complexity of a system or to make it easier to administrate the stored data.

Areas of denormalization

Withdrawal of the first normal form

Violations of the first normal form are mostly carried out to avoid unnecessary complication of the data management and the associated system components.

The first normal form requires the atomic storage of data, which means that only atomic (= indivisible) information may be stored in an attribute (= a database field) . Example: The definition of a 100-character data field for the inclusion of one or more telephone numbers violates the requirements of the first normal form. In order to meet the first normal form, a separate table would have to be created for storing the telephone numbers. Any number of telephone numbers could be saved for one person. However, the accommodation of one or more telephone numbers in a single data field is often completely sufficient and the complexity of the system is thereby reduced.

Another example of a practical violation of the first normal form is the storage of title, first name and surname in a single data field. As long as the individual components of the name do not have to be accessed in the system, saving the individual name components in a single text field would also be a good way of simplifying the system.

In most database systems, the street and house number (with possibly additional letters) are stored in a single data field, although this procedure, strictly speaking, violates the first normal form.

Withdrawal of the second or third normal form

The second and third normal forms require that all dependent attributes may only depend on the key candidates. All relations that do not meet these requirements have to be split up. This creates many new smaller tables. To access this data, the data of these individual tables must be merged again by using SQL statements with joins . The execution of a join is usually more time-consuming for the DBMS than accessing a single table.

The second or third normal form is usually withdrawn with the aim of avoiding a join. It typically involves two tables that are in a one-to-one relationship. Example: employee and department. If a lot of performance-critical read accesses require the employee's data and also the department name, then the additional storage of the department name in each record of the employee table can be useful. These accesses can then be handled solely from the data in the employee table. Additional access to the department table is no longer required.

This type of denormalization is perfected in the dimensional modeling of a data mart for a data warehouse . If the dimension tables are designed to be completely normalized, there are a large number of individual tables that are linked to one another by foreign key relationships . The data model looks like a snowflake , hence the name snowflake scheme . Access to the data often requires joins to the many individual tables removed by normalization. This is in contrast to the star scheme, in which the dimension tables are designed in a denormalized manner. The fact table is only directly dependent on the individual dimension tables . There are no dependencies that have to be made across multiple foreign key relationships. The number of dimension tables is fewer and fewer joins are required to access the tables. However, there is redundancy in the data in the dimension tables. The performance of data access is usually better with the star scheme, so this scheme is usually chosen in practice.

Anticipated aggregation

Large aggregations often need to be performed to execute queries . This is particularly the case with OLAP systems . If the response time of the queries is no longer acceptable, then the aggregations can also be calculated and saved in advance. This is ideal for systems that are only updated at night. Then, after the actual update of the data, all possible aggregations are calculated and saved. If a user then requests a key figure ( KPI ) during the day , then all the necessary aggregations are already available and the key figure can be output in seconds.

Fragmentation

A distinction is made between horizontal and vertical fragmentation.

With the horizontal fragmentation ( English sharding ) the totality of all data records of a relation is divided into several tables. If these tables are on the same server, it is mostly a question of partitioning. However, the individual tables can also be located on different servers. So z. B. the data for the business in the USA are stored on a server in the USA and the data for the business with Europe are on a server in Germany. This division is also known as regionalization.

Horizontal fragmentation does not create redundancy of the stored data, but of the structures. If a relation has to be changed, then not only one table has to be changed, but all tables over which the data from the relevant relation is distributed have to be changed. There is a risk of anomalies in the data structures here.

With vertical fragmentation , the dependent attributes (non-key attributes) of a table are divided into two or more groups. Each group becomes its own table, which is supplemented by all key attributes of the original table. This can be useful if the attributes of a relation result in data records with a very large record length. If the accesses mostly only concern a few attributes, then the few frequently accessed attributes can be combined in one group and the rest in a second group. Frequent accesses become faster because a smaller amount of data has to be read from the hard disk . The infrequent accesses to the remaining attributes are not faster, but neither are they slower.

The length of the record from which it makes sense to split into several smaller tables also depends on the database system. Many database systems save the data in the form of blocks with a size of 4  KiB , 8 KiB or 16 KiB. If the average record length is a little larger than 50% of a data block, then a lot of storage space remains unused. If the average record length is larger than the block size used, data access becomes more complex. If BLOBs appear in a relation with other attributes, vertical fragmentation is almost always an advantage.

Partitioning

Partitioning is a special case of horizontal fragmentation.

Large data stocks can be administered more easily if the data of a relation is divided into several small parts (= partitions) and these are saved separately. If a partition in a table is being updated, other partitions in the table can be reorganized at the same time. If an error is discovered in one partition, this single partition can be restored from a data backup, while programs can continue to access the other partitions. Most established database manufacturers offer partitioning, see e.g. B. Partitioning in DB2 and partitioning in MySQL .

Most database systems offer the option of either addressing individual partitions or addressing all partitions under a uniform table name.

The data access can be accelerated by partitioning. The main advantage, however, is the easier administration of the entire table.

index

The creation of an index is also a redundant data storage and thus - strictly speaking - a denormalization. Most database management systems are able to automatically update an index whenever the data in the base table changes. Indexes can increase the performance of read accesses and write accesses, since a targeted search for certain records is supported. Often, however, write accesses become slower the more indices are available, since both the data in the table and the data in the indices have to be updated.

disadvantage

The additional effort that has to be made to keep the redundant data consistent is often a disadvantage . There is a risk of data anomalies due to redundant storage.

This risk can be avoided if it is possible to delegate the updating of the redundantly stored data to the database management system.

The database manufacturers offer various functions to automatically synchronize redundantly stored data.

  • It goes without saying that an index is updated automatically that you would not expect anything else.
  • The anticipation of aggregations is supported by materialized views , which automatically update these aggregations as required.
  • There are also triggers with which redundantly stored data can be automatically updated.

If the database management system takes on such tasks, then the updating of an individual data record may only be imperceptibly slowed down. Mass data processing can, however, be significantly slower when using such functions.

Most of the time, denormalization results in additional memory requirements. Often, however, one is willing to pay the cost of additional storage space to improve performance. In each individual case, it must be weighed whether the advantages are worth accepting the associated disadvantages.