Redundancy (information theory)

The concept of redundancy (from the Latin redundare, overflowing, abundantly pouring out) describes in information theory that information that is present several times in one information source. An information unit is redundant if it can be omitted without loss of information. Identifying and removing such redundancies is called deduplication .

Transmission of messages and information

Redundant is the part of a message that does not contain any information . The redundant part of the message can be a function of the information contained in the message. In information technology and communications technology applications, redundancy is used specifically to detect errors. A stronger redundancy enables not only the detection of errors but also their correction. Redundancy thus allows an increase in quality (fewer errors) at the expense of quantity (higher data rate). The strength of the redundancy to be used in each case depends on the fault tolerance of the respective application - in banking and space travel a single overturned bit could cost a lot of money, while in Internet telephony or DVB even the constant loss of entire packets is irrelevant.

Fault tolerance

A communication can be carried out error-tolerant through redundant information via an information channel , since under certain circumstances lost or falsified partial information can be reconstructed by the recipient from its context . The Hamming distance is a measure of the error tolerance .

Average code word length

Let Z be an alphabet and z ∈ Z

C (z) denotes the code word belonging to z

l (z) denotes the length of C (z)

The mean code word length L (C) of a source code C (z) with the probability distribution p (z) is given by:

{\ displaystyle L (C) = \ sum _ {i = 1} ^ {| Z |} {l (z_ {i}) p (z_ {i})}}

Redundancy of a code

The redundancy of the code is the difference between the mean code word length and entropy . (Example: Huffman coding for optimal (= minimal) ). ${\ displaystyle L (C)}$ ${\ displaystyle H (X)}$ ${\ displaystyle L (C)}$

{\ displaystyle R _ {\ text {Code}} = L (C) -H (X) \,}

The redundancy of the source is the difference between maximum entropy and the entropy of the message source. ${\ displaystyle H _ {\ text {max}} (X) = \ log _ {2} | Z |}$ ${\ displaystyle H (X)}$

{\ displaystyle R _ {\ text {source}} = \ log _ {2} | Z | -H (X) \,}

Since the code word length cannot be less than the entropy, the redundancy is never negative.

Coding

Coding theory distinguishes between two manifestations of redundancy:

The distribution redundancy lies in the different probability of occurrence of the individual characters of the alphabet.
The tie redundancy is that after certain characters, the occurrence of a certain other character is particularly likely. For example, in a German text, a q is almost always followed by a u.

Databases and data structures

In database development and in the data structures of programs, it is important to avoid redundancies as completely as possible, as these can lead to higher memory requirements and inconsistencies . Redundancies are therefore counted among the anomalies . Freedom from redundancy is a basic principle for a logical data model.

Redundancies can be largely avoided by normalizing the database schema. There are also redundancies that are unavoidable (for example key redundancies ) and are therefore accepted as a necessary evil . There are also redundancies that are accepted because avoiding them would represent too much effort in relation to their problem, such as multiple occurrences of an attribute value or the double storage of the name Müller for Mr. Müller and for Ms. Müller.

The deliberate acceptance of redundancy to achieve better reading performance is called denormalization .

disadvantage

Redundancies in the data structures of programs and databases can lead to program errors. The programmer must ensure that he also keeps the redundant data consistent with all changes . This requires a high level of synchronization effort. The larger the project and the longer the project is being developed, the more difficult this is. When multiple programmers unknowingly work independently on redundant data, it is almost impossible to keep the changes consistent.

advantages

There are some cases in which deliberately created data redundancy reduces the computing time of the software. It can be achieved through targeted denormalization . However, this precisely calculated and desired redundancy must be clearly distinguished from redundancy that has arisen negligently because someone does not apply the normalization rules. Denormalizations usually improve read performance, but degrade write performance.

literature

F. Topsoe: Information Theory. An introduction, BG Teubner Verlag, Stuttgart 1974, ISBN 978-3-519-02048-6 .
Otto Mildenberger: Information Theory and Coding. 2nd Edition. Friedrich Vieweg & Sohn Verlagsgesellschaft, Wiesbaden 1992, ISBN 3-528-13046-6 .
Werner Meyer-Eppler: Fundamentals and applications of information theory. 2nd edition, Springer Verlag, Berlin / Heidelberg 1969, ISBN 978-3-642-49130-6 .
Martin Bossert: Introduction to communications technology. Oldenbourg, Munich 2012, ISBN 978-3-486-70880-6 .
Ernst Schultze: Introduction to the mathematical foundations of information theory. Springer Verlag, Berlin / Heidelberg 1969, ISBN 978-3-540-04633-2 .
Martin Werner: communications engineering. An introduction to all courses, 7th edition, Vieweg + Teubner Verlag, Wiesbaden 2010, ISBN 978-3-8348-0905-6 .

Web links

Basics of Information Theory (accessed February 23, 2018)
Coding Theory (accessed February 23, 2018)