Normalization (database)

from Wikipedia, the free encyclopedia

Under normalization of a relational data schema (table structure) is the division of attributes (columns) in a plurality of relations (tables) in accordance with the normalization rules (s. U.), So that a form that no redundancies contains more.

A conceptual schema that contains data redundancies can lead to the fact that, when the database implemented with it is changed, the multiple data contained are not changed consistently , but only partially and incompletely, which can make them obsolete or contradictory. One speaks of occurring anomalies . In addition, multiple storage of the same data takes up storage space unnecessarily. To avoid redundancy, such tables are normalized.

There are different degrees to which a database schema can be immune to anomalies. Depending on which one speaks of it being in the first, second, third etc. normal form. These normal forms are defined by certain formal requirements for the schema.

A relational data schema is brought into a normal form by progressively breaking down its relations into simpler ones based on the functional dependencies that apply to them , until no further breakdown is possible. However, under no circumstances should data be lost. With Delobel's theorem , one can formally prove for a decomposition step that it does not entail any data loss.

Normalization is mainly done in the relational database design phase . There are algorithms for normalization ( synthesis algorithm ( 3NF), decomposition algorithm ( BCNF) , etc.) that can be automated.

The decomposition methodology follows relational design theory .


Objective: Increase in consistency by avoiding redundancy

Splitting of the table TBL_AdressenAlles

During normalization in these areas, columns (synonymous terms: fields , attributes ) of tables within these areas (the data schema) are first divided into new columns, e.g. B. Addresses in postcode , city and street . Tables are then split up, for example a table
tbl_AdressenAlle all with the fields company , street , zip code and city in these tables:

  • tbl_Adressen with the fields AdressID, Company , Street and ZIP
  • tbl_PLZOrt with the fields Postal code and place

See the illustration of the splitting of the table tbl_AdressenAlles - whereby the table tbl_Adressen still contains the unique primary key AdressID .

Note: In this example it is assumed that there is only one place name for each zip code, which is very often not the case in Germany - for example in rural areas, where up to 100 places “share” a zip code.

The purpose of normalization is to reduce redundancies (multiple recording of the same issue) and to prevent anomalies caused by them (e.g. as a result of changes in not all places) in order to simplify the updating of a database (changes only in one place) as well as ensuring the consistency of the data .


An example of this: A database contains customers and their addresses as well as orders that are assigned to the customers. Since there can be several orders from the same customer, entering the customer data (possibly with address data) in the order table would result in them appearing there multiple times, although the customer only ever has one set of valid data ( redundancy ). For example, it can happen that incorrect address data for the customer is entered in one order, and the correct data is recorded in the next order. This can lead to contradicting data - in this table or in relation to other tables. The data would then not be consistent, you would not know which data is correct. It is even possible that both addresses are incorrect because the customer has moved (see below for solution) .

With a normalized database, there is only one entry for the customer data in the customer table, with which each order for this customer is linked (usually via the customer number). In the case of a customer moving (another example is the change in VAT), there would be several entries in the corresponding table, but they can also be distinguished by specifying a validity period and, in the customer example above, can be clearly addressed using the combination of order date / customer number .

Another advantage of freedom from redundancy, which still plays an important role with millions of data records in a database, is the lower memory requirement if the data record in one table, for example tbl_auftrag, is transferred to a data record in another table, e.g. B. tbl_Kunde instead of containing this data itself.

Splitting tables for normalization

These are the recommendations that are made on the basis of the theory of normalization in database development in order to ensure, above all, the consistency of the data and a clear selection of data. The freedom from redundancy sought for this, however, is in competition with the processing speed or with other goals in special applications. It can therefore make sense to forego normalization or to undo it by denormalizing in order to

  • to increase the processing speed (performance) or
  • To simplify inquiries and thus to reduce the susceptibility to errors or
  • Map special features of processes (e.g. business processes ).

In these cases, automatic adjustment routines should be implemented regularly in order to avoid inconsistencies. Alternatively, the relevant data can also be blocked for changes.

Normal forms

Currently used normal forms are:

  • 1.Normal form (1NF)
  • 2. Normal form (2NF)
  • 3.Normal form (3NF)
  • Boyce - Codd normal form (BCNF)
  • 4.Normal form (4NF)
  • 5.Normal form (5NF)

On the one hand, they are used to assess the quality of the database schema under consideration , and on the other hand, they help to avoid errors when creating new schemas.

In addition, with the help of normalization, data structures can be obtained from non-relational sources that are formally correct in the sense of the normalization concept and can incorporate the data from their respective non-relational sources from which they arose (e.g. form data or spreadsheets).

The criteria for the respective normal forms are explained below. It should be noted that every normal form includes the criteria of the previous normal forms, i.e. H. for the following criteria amounts apply: .

First normal form (1NF)


Each attribute of the relation must have an atomic value range and the relation must be free of repeating groups. (Note: instead of "atomic", the term "atomic" is also used.)

Atomic means that composite, set-valued or nested value ranges ( i.e., relation-valued attribute value ranges) are not allowed. In a relation that is in 1NF, there is no attribute whose value range can be split into further (meaningful) sub-areas.

Example: The address must not be used as a single attribute, but must - if the underlying process requires and allows it - be divided into zip code, city, street, house number, etc.

Free of repeating groups means that attributes that same or equal -like contain information that must be stored in a different ratio.

An example of a repeating group would be a column {Telephone} that contains several telephone numbers or a column group {Telephone1, Telephone2, Telephone3}, although in the latter case it should be noted that this does not necessarily have to be a repeating group (see alternative formulations ).

Practical use

Queries of the database are made easier by the 1NF or only made possible at all if the attribute value ranges are atomic . For example, in a field that contains an entire name string consisting of title, first name, and last name, it is difficult or even impossible to sort by last name.

Alternative formulations

All attributes contain atomic content and the relation has a fixed width . This formulation refers to the fact that it should never be necessary to include further attributes in the relation because the repetition number of the repetition group is too small (e.g. with three attributes Telephone1–3 a 4th telephone number is known for a person ). It is interesting in that it can help determine whether there is actually a repeating group. B. {.., Telefon1, Telefon2, Telefon3, ..} strongly implies the existence of a repeating group, it could become clear with only other attribute names that - admittedly under the light of the application - this does not have to be the case: {.. , Telephone, fax, mobile, ..}

Another variant is created by the following addition: .. and the relation has a primary key . Although this formulation cannot be read in this way from Codd , it is an extension that leads to extremely practical data structures .

Negative example: 1NF injured

CD_ID album founding year Publishing year Track list
4711 Anastacia - Not That Kind 1999 2000 {1. Not That Kind, 2. I'm Outta Love, 3. Cowboys & Kisses}
4712 Pink Floyd - Wish You Were Here 1965 1975 {1. Shine On You Crazy Diamond}
4713 Anastacia - Freak of Nature 1999 2001 {1. Paid my Dues}
  • The Album field contains the artist and album title attribute value ranges.
  • The Title List field contains a lot of titles.

As a result, you have the following problems with queries without splitting:

  • To sort by album title, the Album field must be divided into artist and album title.
  • The titles can (with simple means) only be displayed all at the same time as a title list or not at all.


CD_ID Album title Interpreter founding year Publishing year Track title
4711 Not that child Anastacia 1999 2000 1 Not that child
4711 Not that child Anastacia 1999 2000 2 I'm Outta Love
4711 Not that child Anastacia 1999 2000 3 Cowboys & Kisses
4712 Wish You Were Here Pink Floyd 1965 1975 1 Shine On You Crazy Diamond
4713 Freak of Nature Anastacia 1999 2001 1 Paid my dues

The attribute value ranges are split into atomic attribute value ranges:

  • The Album field is split into the Album Title and Artist fields .
  • The Title List field is split into the Track and Title fields and divided into several records.

Since each attribute value range is now atomic and the table has a unique primary key (compound key from the columns CD_ID and Track ), the relation is in 1NF.

Second normal form (2NF)


A relation is in the second normal form if and only if the first normal form is present and no non-primary attribute (attribute that is not part of a key candidate) is functionally dependent on a real subset of a key candidate.

In other words, every non-primary attribute (not part of a key) is dependent on all whole keys, not just part of a key. It is important here that the non- key attributes really depend completely on all keys.

It is therefore true that relations in the 1NF whose key candidate (s) are not composed, but only consist of (one) individual attribute (s), automatically fulfill the 2NF.

In a relation R (A, B), the attribute B is functionally dependent on the attribute A, if exactly one value of the attribute B belongs to each value of the attribute A. In a relation R (S1, S2, B) the attribute B is functionally dependent on the key attributes S1 and S2, if B is functionally dependent on the composite attributes (S1, S2) but not on a single attribute S1 or S2.

This informal definition can be refined as follows:

A relation is in the second normal form if and only if it is

  1. is in the first normal form and
  2. for each attribute of the relation applies:
    • is part of a key candidate or
    • is dependent on a key candidate and is not dependent on a real subset of a key candidate.

is fully functionally dependent on each key candidate (the key candidate KC can also be formed by combining several attributes). The 2NF eliminates all partial functional dependencies, i.e. H. no non-key attribute is functionally dependent on parts of the key candidate.

If a key candidate has two attributes, a maximum of three relations can arise when breaking it down into the 2NF. If a key candidate has three attributes, the breakdown into 2NF can result in a maximum of seven relations. This is the number of subsets of a given set minus 1 (empty set) and corresponds to the number of elements of the power set ( ) as the upper limit.

Practical use

The 2NF essentially enforces “monothematic” relations in the schema: each relation only models one issue.

This reduces redundancy and the associated risk of inconsistencies. Only logically / factually related information can be found in a relation. This makes it easier to understand the data structures.

Negative example: 2NF injured

CD_ID Album title Interpreter founding year Publishing year Track title
4711 Not that child Anastacia 1999 2000 1 Not that child
4711 Not that child Anastacia 1999 2000 2 I'm Outta Love
4711 Not that child Anastacia 1999 2000 3 Cowboys & Kisses
4712 Wish You Were Here Pink Floyd 1965 1975 1 Shine On You Crazy Diamond
4713 Freak of Nature Anastacia 1999 2001 1 Paid my dues
  • The primary key of the relation is composed of the fields CD_ID and Track . (In principle, a primary key can consist of several attributes, but this creates a conflict in the example mentioned.)
  • The fields album title , artist and year of publication depend on the field CD_ID , but not on the field Track . This (point 2) violates the 2nd normal form, since the three non-primary attributes cannot only depend on part of the key (here CD_ID ). If the key were not composed (see point 1), this could not happen.

Problems that arise from it

As can be seen from the example of the CD Not That Kind, the information from these three fields is present several times, i. H. redundant. This runs the risk of violating the integrity of the data . For example, you could change the album title for the song Not That Kind to I Don't Mind without changing the corresponding entries for the songs I'm Outta Love and Cowboys & Kisses ( update anomaly ).

CD_Lied (inconsistent)
CD_ID Album title Interpreter founding year Publishing year Track title
4711 Not that child Anastacia 1999 2000 1 Not that child
4711 Not that child Anastacia 1999 2000 2 I'm Outta Love
4711 Not that child Anastacia 1999 2000 3 Cowboys & Kisses
4712 Wish You Were Here Pink Floyd 1965 1975 1 Shine On You Crazy Diamond
4713 Freak of Nature Anastacia 1999 2001 1 Paid my dues

In this case, what is known as data inconsistency would be reached . Viewed across the entire table, the data no longer “fit” together.


The data in the table is divided into two tables: CD and song . The table CD now only contains fields that functionally depend on CD_ID , so it has CD_ID as the primary key. The album title alone is also clear, i.e. a key candidate. Since there are no further (composite) key candidates, the table is automatically in the 2nd normal form. The "Song" table finally only contains fields that are functionally dependent on CD_ID and track , so it is also in the 2nd normal form. With the help of this loss-free decomposition, the aforementioned redundancies of the data are also eliminated.

CD_ID Album title Interpreter founding year Publishing year
4711 Not that child Anastacia 1999 2000
4712 Wish You Were Here Pink Floyd 1965 1975
4713 Freak of Nature Anastacia 1999 2001
CD_ID Track title
4711 1 Not that child
4711 2 I'm Outta Love
4711 3 Cowboys & Kisses
4712 1 Shine On You Crazy Diamond
4713 1 Paid my dues

The CD_ID attribute from the Lied table is called a foreign key , which refers to the primary key in the CD table . At the same time, the attributes CD_ID and Track represent the composite primary key of the Song table.

Third normal form (3NF)


The third normal form is reached exactly when the relational schema is in the 2NF and no non-key attribute (light gray cells in the table) depends transitively on a key candidate .

An attribute is transitively dependent on the key candidate if there is a set of attributes such that and .

This is a dependency in which an attribute is dependent on a key candidate of the relation via a set of attributes (without at the same time being directly dependent on, i.e. a key candidate). That is, if the set of attributes of the attribute set dependent and attribute of , then transitive depending on . Formal words .

Simply put, a non-key attribute cannot be dependent on a set of non-key attributes. A non-key attribute can therefore only be directly dependent on a primary key (or a key candidate).

See also: Transitive Relation , Synthesis Algorithm

Practical use

Transitive dependencies are immediately apparent without having to know the relationships between the data. They are represented by the structure of the relations.

In addition, any remaining thematic mix-ups in the relation are eliminated: after the 3NF, the relations of the schema are reliably monothematic.

Alternative formulation

The third normal form is reached when the relation schema is in 2NF and no non-key attribute (light gray cells in the table) is a determinant .

Or: The third normal form is reached when the relation schema is in 2NF and no non-key attribute (light gray cells in the table) is functionally dependent on another non-key attribute.

Negative example: 3NF injured

CD_ID Album title Interpreter founding year Publishing year
4711 Not that child Anastacia 1999 2000
4712 Wish You Were Here Pink Floyd 1965 1975
4713 Freak of Nature Anastacia 1999 2001
CD_ID Track title
4711 1 Not that child
4711 2 I'm Outta Love
4711 3 Cowboys & Kisses
4712 1 Shine On You Crazy Diamond
4713 1 Paid my dues

Obviously, the leaves album title of a CD from CD_ID determine the founding years of the band / artist in turn depends on performers and transitive from CD_ID from.

The problem here is again data redundancy. For example, if a new CD is introduced with an existing artist , the year of foundation is saved redundantly.


CD_ID Album title Interpreter Publishing year
4711 Not that child Anastacia 2000
4712 Wish You Were Here Pink Floyd 1975
4713 Freak of Nature Anastacia 2001
Interpreter founding year
Anastacia 1999
Pink Floyd 1965
CD_ID Track title
4711 1 Not that child
4711 2 I'm Outta Love
4711 3 Cowboys & Kisses
4712 1 Shine On You Crazy Diamond
4713 1 Paid my dues

This solution only applies if one assumes that the interpreter is unique worldwide. Otherwise you would have to add a synthetic ID in the artist table, which then places the foreign key in the CD table as follows:

CD_ID Album title Interpreter_ID Publishing year
4711 Not that child 311 2000
4712 Wish You Were Here 312 1975
4713 Freak of Nature 311 2001
Interpreter_ID Interpreter founding year
311 Anastacia 1999
312 Pink Floyd 1965
CD_ID Track title
4711 1 Not that child
4711 2 I'm Outta Love
4711 3 Cowboys & Kisses
4712 1 Shine On You Crazy Diamond
4713 1 Paid my dues

The relation is split up, with the two interdependent data being stored in a separate table. The key of the new table must be retained as a foreign key in the old table.

No changes were made to the "Song" table when it was transferred to the 3rd normal form. It is only listed here for the sake of completeness.

Boyce-Codd normal form (BCNF)


A relational scheme is in Boyce-Codd normal form if it is in the 3NF and every determinant (set of attributes on which other attributes functionally depend) is a key candidate (or the dependency is trivial).

The BCNF (after Raymond F. Boyce and Edgar F. Codd ) prevents parts of two key candidates composed of several fields from being dependent on one another.

The transfer to the BCNF is always possible without loss, but it does not always preserve dependency. The Boyce-Codd normal form was originally intended as a simplification of the 3NF, but led to a new normal form that tightened it: A relation is automatically free of transitive dependencies if all determinants are key candidates.

Negative example: BCNF injured

In this example there is a simple database in which the club membership of athletes is stored. The following conditions should apply:

  • each club only offers one sport.
  • an athlete can play in different clubs, but only if these clubs offer different sports. This ensures that the athlete never plays against a club of which he is a member.
Surname sport society
Cobbler Soccer FC Musterhausen
Leitner Soccer FC Musterhausen
Leitner ice Hockey EC example city

From the above conditions it follows that the attribute sport is functionally dependent on the attribute club (club → sport), i. H. Association is a determinant. However, the club is not a key candidate. Possible key candidates are {name, club} and {name, sport}. A conversion to BCNF is possible by using (name, club) as the primary key and dividing the relation:


Surname society
Cobbler FC Musterhausen
Leitner FC Musterhausen
Leitner EC example city
society sport
FC Musterhausen Soccer
EC example city ice Hockey

Decomposition algorithm

There is an algorithm that converts relational schemes into the Boyce-Codd normal form by decomposition. All schemes are split up until none of them breaks the BCNF. Each split occurs on the basis of a functional dependency that violates the BCNF. The attributes of the offending dependency form the first new schema, and the remaining attributes plus the determinant form another schema. The two new schemes contain only those of the original functional dependencies that only use attributes of the respective scheme, the rest is lost.

The following pseudocode describes the decomposition algorithm:

1: A relational schema is given , with the set of all attributes and the set of functional dependencies on these attributes.
2: The result set decomposition , consisting of the decomposed schemes, is initialized with .
3: As long as there is a schema in the set decomposition that is not in the BCNF, do the following decomposition:
Let be a set of attributes for which a functional dependency is defined which contradicts the BCNF.
Replaced in the result set decomposition by two new schemes , a scheme consisting of only the attributes of a function having the BCNF originally injured; and , a schema with all attributes except those that are only contained in the dependent set and not in the determinant . The set of functional dependencies now only contains those dependencies that only contain attributes from ; the same applies to . This eliminates all dependencies that require attributes from both schemes.
6: Result: decomposition - a set of relational schemas that are in the BCNF.

Run through the algorithm in the example above (without showing all trivial dependencies):

  • 1: R = ({name, sport, club}, {({name, sport} → {club}), ({club} → {sport}), ({name, club} → {name, club})} )
  • 2: decomposition  = {R}
  • 3: since R from decomposition does not meet the BCNF do the following:
    • 4,5: {club} → {sport} is the dependency that causes the violation of the BCNF, so  = ({club, sport}, {({club} → {sport})}) and  = ({name, Club}, {({Name, Club} → {Name, Club})})
  • 6: Result:

Difference to the 3NF

The BCNF normal form is stricter with regard to the permitted functional dependencies: some information can appear twice in relation schemas in 3NF, but not in the BCNF.

Fourth normal form (4NF)


A relation scheme is in the 4th normal form if it is in the BCNF and only contains trivial multi-valued dependencies (MWA).

To put it simply: Within a relation there may not be several 1: n or m: n relationships to a key value that have nothing to do with each other thematically / in terms of content. For example , if a key value i times attribute a , but regardless of this also j times attribute b , the 4NF is violated.

In clear terms: The 4NF examines n-ary relationships (more than two tables are related at the same time) and whether these were modeled correctly.

Negative example: 4NF injured

Person number domestic animal vehicle
1 cat Volkswagen
1 cat Ferrari
1 hamster Volkswagen
1 hamster Ferrari
2 dog Porsche

There are several pets and vehicles for one person number. In principle, however, a person's pets and vehicles have nothing to do with each other; they are said to be "independent of each other." Only a combination of all three attributes can be used as the primary key, so the table is in 3NF. Person number → pet is a multi-valued dependency, person number → vehicle too. Since these two MWAs are independent of each other, the 4NF is violated.

Example 4NF

The example graphic shows the incorrect modeling of the multi-valued dependencies and the correct solution. There is no relationship between pet and vehicle, so the “owns” relationship was incorrectly modeled.


Person number domestic animal
1 cat
1 hamster
2 dog
Person number vehicle
1 Volkswagen
1 Ferrari
2 Porsche


The 4NF fulfills the following relationship scheme, although there are also several MWAs:

person partner child
1 2 Gabi
1 88 Peter
1 99 Hilbert
2 1 Gabi
2 77 Hans

Person → partner and person → child are two MWAs, but these two are also interdependent: partner → child. Such interdependent MWAs are only resolved in 5NF.

Fifth normal form (5NF)


A relation is in 5NF if it is in 4NF and does not contain any multi-valued dependencies that are dependent on one another.

To put it simply: There must not be several 1: n or m: n relationships to a key value within a relation that are thematically / content-related. For example , if a key value i times attribute a , but depending on this also j times attribute b , the 5NF is violated.

The 5NF therefore requires simplified relations, from which, however, all information of the original relation must be recoverable through projection and compound operations . It is therefore very general and therefore (for the time being) the last normal form. In this way, relations can be divided into individual queries and put back together again through later compound operations, whereby a subset of the so-called Cartesian product is created.

The 5NF supports the consistency insofar as the division can also result in new combinations if, in theory, other combinations would also result when adding information, but these are not taken into account if all the attributes are in a single table of the relation.

Negative example: 5NF injured

The following relation shows which suppliers can deliver which components to which project:

supplier part Project
Müller screw Project 1
Müller nail Project 2
Maier nail Project 1

The relation has to be divided further, because it also depends on the project which parts are required for it. It is also important that the relation can be split up without losing any information.


In order to convert this relation into the 5th normal form, three relations must be created (supplier-part, sub-project and supplier-project).

  • Which parts can which supplier deliver?
Supplier part
supplier part
Müller screw
Müller nail
Maier nail
  • Which parts are required by which project?
part Project
screw Project 1
nail Project 2
nail Project 1
  • Which projects can be supplied by which supplier?
Supplier project
supplier Project
Müller Project 1
Müller Project 2
Maier Project 1


In contrast to the transformation between the previous normal forms, this transformation expresses something different through the new relations than before in the 4th normal form.

This is easy to notice when you combine the three relations from the example above:

supplier part Project
Müller screw Project 1
Müller nail Project 2
Müller nail Project 1
Maier nail Project 1

The tuple is new : Müller - Nagel - Project 1.

Because Müller could theoretically supply project 1 with nails, there

  • he also supplies project 2 with nails and
  • Project 1 also required nails (which, however, were previously supplied by Maier).

The transfer to 5NF is only possible if you want to express the possibilities of the connections from three relationships and not want a concrete connection between the three. If used correctly, this division results in new information, such as here that Müller could also supply project 1 with nails.


Weaknesses in the data model due to a lack of normalization can - in addition to the typical anomalies - mean a higher effort for later further development. On the other hand, normalization steps can be deliberately dispensed with when designing a database for reasons of performance ( denormalization ). A typical example of this is the star schema in the data warehouse .

The creation of a normalized schema is supported by automatic derivation from a conceptual data model; in practice, an extended entity relationship model (ERM) or a class diagram of the Unified Modeling Language (UML) serves as a starting point. The relation scheme derived from the conceptual draft can then be checked with the help of the normalizations; however, formalisms and algorithms exist that can already ensure this property.

Instead of the ER model originally developed by Peter Chen in 1976, extended ER models are used today: the Structured ERM (SERM), the E3R model, the EER model and the SAP SERM used by SAP AG.

If a relation scheme is not in the 1NF, this form is also called Non-First-Normal-Form (NF²) or Unnormalized Form (UNF).

The process of normalizing and decomposing a relation into the 1NF, 2NF and 3NF must preserve the recoverability of the original relation, that is, the decomposition must be true to the link and true to the dependency .

Memory rules

  1. If the relation is in the 1st normal form and the primary key consists of only one attribute and there is no other key consisting of several attributes, then the 2nd normal form is automatically available.
  2. If a relation is in 2nd normal form and if it has at most one other attribute besides the primary key that is not part of a key, then the table is in 3rd normal form.


  • Ramez Elmasri, Shamkant B. Navathe: Basics of database systems. Pearson Studies, 2002, ISBN 3-8273-7021-3
  • Alfons Kemper, Andre Eickler: Database systems. An introduction. Oldenbourg, Munich 2004, ISBN 3-486-27392-2
  • Stefan M. Lang, Peter C. Lockemann: Use of databases. Springer, Berlin a. a. 1995, ISBN 3-540-58558-3 .

Web links

Individual evidence

  1. ^ Philip M. Lewis, Arthur Bernstein, Michael Kifer: Databases and transaction processing: an application-oriented approach . Addison-Wesley, 2002, ISBN 0-201-70872-8 , pp. 232 .
This version was added to the list of articles worth reading on September 12, 2005 .