# Information content

The information content (or surprise value ) of a message is a logarithmic quantity that indicates how much information was transmitted in this message.

This term was first formalized by Claude Shannon in his information theory : the information content of a sign is its statistical significance . It describes the minimum number of bits that are required to represent or transmit a character (i.e. information). It is important that this does not necessarily correspond to the number of bits actually received (the amount of data ), since the information content depends on the semantic context.

## definition

The information content of a character x with a probability of occurrence p x is defined as

${\ displaystyle I (x) = \ log _ {a} \ left ({\ frac {1} {p_ {x}}} \ right) = \ log _ {a} (1) - \ log _ {a} (p_ {x}) = - \ log _ {a} (p_ {x})}$.

a corresponds to the thickness of the alphabet (i.e. the number of possible states of a news source).

Depending on the selected base a , the unit of the information content also changes. Shannon already stated this in A Mathematical Theory of Communication . In general, the unit of information content can be designated as Shannon (sh) , but this unit designation has not caught on. In the most common case that the binary alphabet is chosen for the alphabet (with a power of a ), the unit of the information content corresponds to the bit .

In the following text, a = 2 (the binary system ) is assumed, which gives the number of binary digits (in bits ) as the result . Any other number system could be used instead.

## General

The concept of information as used in Shannon's information theory must be strictly distinguished from the usual usage of this term. In particular, it must not be equated with the concept of meaning . In Shannon's theory, e.g. For example, two messages, one of which is of special importance while the other is just "nonsense", contains exactly the same amount of information. For the simple case in which there is only a choice between two possible messages, it is arbitrarily determined that the information associated with this situation is equal to 1. The two messages between which a decision is to be made in such a selection can be completely arbitrary. A message could e.g. B. be the text of the phone book and the other message the single letter "A". These two messages could then be coded using the symbols 0 and 1, for example.

More generally, a sequence of selection processes from a set of elementary characters is carried out by any message source, this selected sequence then representing the actual message. It is easy to see here that the probabilities of the characters in generating the message are of particular importance. This is because when the successive characters are selected, this selection is determined by this probability, at least from the point of view of the communication system. In most cases, these probabilities are even dependent on one another, i.e. that is, they depend on the previous selection events. Is z. If, for example, the last word of a word sequence is the article “die”, then the probability that an article or verb will appear again as the next word is very low.

A measure that meets the natural requirements that are placed on this information measure in a special way corresponds exactly to what has become known as entropy in statistical physics . How this information measure depends on the corresponding probabilities is explained in the following section.

The information to be transmitted is formally referred to as a character . Only a finite character set is available, but characters can be combined as required. The minimum number of bits required to represent or transmit a character now depends on the probability with which a character occurs: fewer bits are used for characters that occur frequently than for characters that are rarely used. Data compression techniques make use of this, particularly entropy codes such as arithmetic coding and Huffman coding . A similar procedure is used to balance binary trees .

The smaller the probability of occurrence of a character, the higher its information content. Conversely, the information content of a sign is very low if it occurs very often.

In principle, the information content for statistically independent events and statistically dependent events is calculated differently.

One could also say that the information content of a character is proportional to the (negative) logarithm of the probability with which one can guess it. The information content is therefore a measure of the maximum efficiency with which information can be transmitted.

An alternative measure for the information content of a character string is the Kolmogorov complexity or the algorithmic information content : it is defined as the length of the shortest program that can generate this character string. Another approach is the so-called algorithmic depth , which states how complex it is to generate a certain message. Gregory Chaitin also went beyond Shannon's definition of the entropy of information (see Algorithmic Information Theory ).

In this context, the cross entropy and the Kullback-Leibler divergence also play a role as measures for the wastage of bits caused by poor coding.

## Information content of statistically independent events

Let be a sequence of n statistically independent successive events . The information content is then the sum of the information content of all events: ${\ displaystyle x_ {1}, x_ {2}, \ dots, x_ {n}}$ ${\ displaystyle I _ {\ mathrm {ges}}}$

${\ displaystyle I _ {\ mathrm {ges}} = I (x_ {1}) + I (x_ {2}) + I (x_ {3}) + \ dots + I (x_ {n}) = \ sum _ {k = 1} ^ {n} {I (x_ {k})}}$

The information content can also be calculated using the entropy (mean information content of a character). ${\ displaystyle H (X)}$

${\ displaystyle I _ {\ mathrm {ges}} = n \ cdot H (X)}$

With an even distribution of the probabilities for all characters from the alphabet  , the total information can also be calculated using the maximum entropy or the alphabet  size: ${\ displaystyle p = p (x_ {i}) = p (x_ {i + 1})}$${\ displaystyle x_ {i}}$${\ displaystyle Z}$${\ displaystyle | Z |}$

${\ displaystyle I _ {\ mathrm {ges}} = n \ cdot H _ {\ mathrm {max}} (X) = n \ cdot \ log _ {2} ({| Z |})}$  or.  ${\ displaystyle n \ cdot I (p)}$

In the case of equal distribution, the following always applies: ${\ displaystyle p = \ textstyle {\ frac {1} {| Z |}}}$

The information content of the two sources "01010101 ..." and "10010110 ..." is the same when considering statistically independent events according to the above formula. It can be seen that the characters of the first source are ordered by a repeating structure. Therefore, one would intuitively suspect less information in the first chain than in the second chain. When viewed as a statistically independent event, each character is considered individually and the possible connection between several characters is not taken into account.

Another definition of the information of a character is the conditional entropy . It takes into account the appearance of previous characters. In this case, the successive characters are regarded as statistically dependent events.

## Information content of statistically dependent events

In the case of statistically dependent events , the context of the events is known more precisely and conclusions can be drawn from it that influence the information content. In most cases, the following events can be 'guessed' through elimination processes and ties. An example of statistically dependent events is a text in the German language: the 'c' usually occurs in pairs with an 'h' or 'k'. Other letters are also subject to such pairwise ties.

Similar to statistically independent events, the average and context-sensitive information content of a character is multiplied by the number of characters present:

${\ displaystyle I _ {\ mathrm {ges}} = n \ cdot H (X | Y)}$

The conditional entropy is calculated as follows:

${\ displaystyle H (X | Y) = \ sum _ {y} {\,} {p (y) \ cdot H (X | Y = y)} = - \ sum _ {x} ^ {} \ sum _ {y} ^ {} p (x, y) \ cdot \ log _ {2} {p (x | y)}}$

Conditional entropy as the difference between source information and transinformation :

${\ displaystyle H (X | Y) = H (X) -I (X; Y) \,}$

Interpretation: Let X and Y be two stationary dependent sources. Let H (X) be the stationary source entropy. I (X; Y) is the transinformation , the information that flows from X to Y, i.e. the amount of information from which one can infer from X to Y. If this information is high, the dependency on X and Y is also high. Accordingly, the one over X after an observation Y is not so high, since one does not get a lot of new information about Y.

Conditional entropy as total information minus the entropy of H (Y):

${\ displaystyle H (X | Y) = H (X, Y) -H (Y) = H (X, Y) - (I (X; Y) + H (Y | X)) \,}$

Interpretation: In the statistically dependent case, the common information (= I (X; Y)) of X and Y is subtracted from the total information ( composite entropy ). In addition, the new information that Y brings with it should not be included, because in the end you only want to get the amount of information from X that X alone contains. Therefore one calculates: H (X | Y) = H (X, Y) - I (X; Y) - H (Y | X)

Note: The information from statistically dependent events is always less than or equal to that from statistically independent events, since the following applies: H (X | Y) ≤ H (X)

### Join probability H (X, Y)

If there are possible events x and possible events y, the combination probability is the probability that an event occurs in pairs with an event . ${\ displaystyle n}$${\ displaystyle m}$${\ displaystyle p (y_ {j}, x_ {i})}$${\ displaystyle x_ {i}}$${\ displaystyle y_ {j}}$

The probability that the event paired with the event is occurring . ${\ displaystyle x_ {i}}$${\ displaystyle y_ {j}}$${\ displaystyle p (x_ {i}) = \ sum _ {j = 1} ^ {m} {p (x_ {i}, y_ {j})}}$

With the conditional probability, the combination probability results . ${\ displaystyle p (x_ {i}, y_ {j}) = p (x_ {i}) \ cdot p (y_ {j} | x_ {i}) = p (y_ {j}) \ cdot p (x_ {i} | y_ {j}) \,}$

The mean information content of the composite entropy per event pair of statistically dependent events is thus defined by:

${\ displaystyle H (X, Y) = - \ sum _ {i = 1} ^ {n} \ sum _ {j = 1} ^ {m} {p (x_ {i}, y_ {j})} \ cdot \ log _ {2} (p (x_ {i}, y_ {j}))}$

## Information content for analog signals

The information content of an individual value from an analog signal is basically infinite, since the probability of occurrence of a value is zero with a continuous probability distribution. For the mean information content of a real, continuous signal, the differential entropy can be calculated instead of the Shannon entropy .

Alternatively, the signal can be converted to digital with the aid of an analog- digital converter, but information is lost in the process. Since only discrete values ​​appear after the implementation, their information content can be determined again.

## Examples of statistically independent events

### example 1

A character x occurs at a source with the probability p (x) = 0.0625. For maximum efficiency for transmission in a channel, information from is necessary for each character x. ${\ displaystyle I (x) = I (0 {,} 0625) = 4 {\ text {bit}}}$

### Example 2

Given a string “Mississippi”. It consists of n = 11 characters. The alphabet with the probability of occurrence ${\ displaystyle Z = \ {i, M, p, s \}}$

${\ displaystyle p ({\ mathit {i}}) = {\ frac {4} {11}}, \; p ({\ mathit {M}}) = {\ frac {1} {11}}, \ ; p ({\ mathit {p}}) = {\ frac {2} {11}}, \; p ({\ mathit {s}}) = {\ frac {4} {11}}}$

The total information is:

{\ displaystyle {\ begin {aligned} I _ {\ text {ges}} & = \ sum _ {i = 1} ^ {4} n_ {i} \ cdot I ({z_ {i}}) \\ & = 4 \ cdot I (i) +1 \ cdot I (M) +2 \ cdot I (p) +4 \ cdot I (s) \\ & = 4 \ cdot 1 {,} 46 \, \ mathrm {bit} +1 \ cdot 3 {,} 46 \, \ mathrm {bit} +2 \ cdot 2 {,} 46 \, \ mathrm {bit} +4 \ cdot 1 {,} 46 \, \ mathrm {bit} \\ & = 5 {,} 84 \, \ mathrm {bit} +3 {,} 46 \, \ mathrm {bit} +4 {,} 92 \, \ mathrm {bit} +5 {,} 84 \, \ mathrm {bit} \\ & = 20 {,} 06 \, \ mathrm {bit} \ end {aligned}}}

This results in the total number of 21 bits that is necessary to optimally encode the individual letters of the word "Mississippi" in binary.

### Example 3

Alphabet Z = {a, b} with p (a) = 0.01 and p (b) = 0.99. The string consists of 100 characters.

• ${\ displaystyle I (a) = 6 {,} 6439 {\ text {bit}}}$ (rare occurrence ⇒ high level of information in case of occurrence)
• ${\ displaystyle I (b) = 0 {,} 0145 {\ text {bit}}}$ (frequent occurrence ⇒ little information in case of occurrence)

Overall information: ${\ displaystyle I _ {\ text {ges}} = 1 \ cdot I (a) +99 \ cdot I (b) \ approx 8 {,} 08 {\ text {bit}}}$

This results in total information of 9 bits.