Conditional entropy

In information theory , conditional entropy is a measure of the “uncertainty” about the value of a random variable that remains after the result of another random variable becomes known. The conditional entropy is written and has a value between 0 and , the original entropy of . It is measured in the same unit of measure as entropy. ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle H (X | Y)}$ ${\ displaystyle H (X)}$ ${\ displaystyle X}$

Specifically, it has the value 0 if the value of can be determined from functional, and the value if and are stochastically independent . ${\ displaystyle Y}$ ${\ displaystyle X}$ ${\ displaystyle H (X)}$ ${\ displaystyle X}$ ${\ displaystyle Y}$

definition

Let be a discrete random variable and its stock of values , that is to say, there is at most a countable set with . Then the entropy of is through ${\ displaystyle X}$ ${\ displaystyle M}$ ${\ displaystyle M}$ ${\ displaystyle P (X \ in M) = 1}$ ${\ displaystyle X}$

{\ displaystyle H (X): = - \ sum _ {x \ in M} P (X = x) \, \ log _ {b} P (X = x)}

defined, whereby the values 2 ( Bit ) or e ( Nat ) are typically assumed for the corresponding units. If the probability is the same for one , it is set by convention , so the corresponding term is not included in the sum. ${\ displaystyle b}$ ${\ displaystyle x '\ in M}$ ${\ displaystyle P (X = x ')}$ ${\ displaystyle 0}$ ${\ displaystyle P (X = x ') \, \ log _ {b} P (X = x') = 0}$

Let it be an event with . Then one defines the conditional entropy of given by replacing the probability with the conditional probability ; H. ${\ displaystyle A}$ ${\ displaystyle P (A)> 0}$ ${\ displaystyle X}$ ${\ displaystyle A}$

{\ displaystyle H (X | A): = - \ sum _ {x \ in M} P (X = x | A) \, \ log _ {b} P (X = x | A)}

.

Now be a discrete random variable with a range of values . Then the conditional entropy of given is defined as the weighted mean of the conditional entropies of given the events for , i. H. ${\ displaystyle Y}$ ${\ displaystyle L}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle X}$ ${\ displaystyle Y = y}$ ${\ displaystyle y \ in L}$

{\ displaystyle H (X | Y): = \ sum _ {y \ in L: P (Y = y)> 0} P (Y = y) \, H (X | Y = y)}

.

At a higher level of abstraction, it is the conditional expected value of the information function given and the expected value of the function . ${\ displaystyle H (X | A)}$ ${\ displaystyle I_ {X | A} (x): = - \ log _ {b} P (X = x | A)}$ ${\ displaystyle A}$ ${\ displaystyle H (X | Y)}$ ${\ displaystyle I_ {X | Y} (x, y): = - \ log _ {b} P (X = x | Y = y)}$

properties

A memoryless channel connects two sources. The transinformation I (x; y) is the information that was sent by X and also received by Y.

A simple calculation shows

{\ displaystyle H (X | Y) = H (X, Y) -H (Y)}

,

so the uncertainty of given is equal to the uncertainty of and minus the uncertainty of . ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle Y}$

It is with equality if and only if and are stochastically independent. This follows from the fact that if and only if and are stochastically independent. It also means that is, so the entire information received is just misinformation. Similarly, the complete information from source X is lost, so that no transinformation is then available. ${\ displaystyle H (X | Y) \ leq H (X)}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle H (X, Y) = H (X) + H (Y)}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle H (Y) = H (Y | X)}$

Also applies

{\ displaystyle H (X | Y) \ geq 0}

,

with equality if and only if functionally depends on, d. H. for a function . ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle X = f (Y)}$ ${\ displaystyle f}$

Block entropy

Transferred to a multivariate random variable of length , as a representation for a block of symbols , the conditional entropy can be defined as the uncertainty of a symbol (according to a certain given block): ${\ displaystyle X}$ ${\ displaystyle k}$ ${\ displaystyle k}$ ${\ displaystyle (x_ {1}, \ dots, x_ {k})}$ ${\ displaystyle h_ {k}}$ ${\ displaystyle x_ {k + 1}}$ ${\ displaystyle k}$

{\ displaystyle h_ {k}: = H_ {k + 1} -H_ {k}}

with ,

{\ displaystyle h_ {0}: = H_ {1}}

where denotes the block entropy . For the conditional entropy , i.e. the uncertainty of a symbol after a -block, it follows: ${\ displaystyle H_ {i}}$ ${\ displaystyle h_ {1}}$ ${\ displaystyle 1}$

{\ displaystyle h_ {1} = H_ {2} -H_ {1} = H (X) + H (Y | X) -H (X) = H (Y | X)}

The definitions of block entropy and conditional entropy are equivalent in the border crossing, cf. Source tropy .

Transinformation , which indicates the strength of the statistical relationship between two random variables, is also closely related to conditional entropy .

example

Let X be a source that periodically sends the characters ... 00100010001000100010 ...

Now the conditional entropy of the currently observed character is to be calculated taking into account previous characters.

No characters considered

{\ displaystyle p_ {0} = P (X = 0) = \ textstyle {\ frac {3} {4}}}

{\ displaystyle p_ {1} = P (X = 1) = \ textstyle {\ frac {1} {4}}}

{\ displaystyle H (X) = H (p_ {0}, p_ {1}) = - \ textstyle {\ frac {3} {4}} \ cdot \ log _ {2} {(\ textstyle {\ frac { 3} {4}})} - \ textstyle {\ frac {1} {4}} \ cdot \ log _ {2} {(\ textstyle {\ frac {1} {4}})} = 0 {,} 811 \, bit}

The calculation is based on the definition of entropy .

Probability table:

	x = 0	x = 1
P (X = x)	${\ displaystyle \ textstyle {\ frac {3} {4}}}$	${\ displaystyle \ textstyle {\ frac {1} {4}}}$

A mark taken into account

Now let X: = x _t and Y: = x _t-1 . The following probabilities result:

{\ displaystyle P (X = 0 | Y = 0) = \ textstyle {\ frac {2} {3}} \ qquad P (X = 1 | Y = 0) = \ textstyle {\ frac {1} {3} }}

{\ displaystyle P (X = 0 | Y = 1) = 1 \ qquad P (X = 1 | Y = 1) = 0}

{\ displaystyle H (X | Y) = \ sum _ {y \ in Y} ^ {} \ sum _ {x \ in X} ^ {} P (Y = y) \ cdot H (X = x | Y = y)}

{\ displaystyle \ qquad = \ sum _ {y \ in Y} ^ {} P (Y = y) \ cdot H (X | Y = y)}

{\ displaystyle \ qquad = P (Y = 0) \ times H (X | Y = 0) + P (Y = 1) \ times H (X | Y = 1)}

{\ displaystyle \ qquad = \ textstyle {\ frac {3} {4}} \ cdot {\ begin {matrix} \ underbrace {H (X | Y = 0)} \\ {} ^ {\ rm {H (\ ; P (X = 0 | Y = 0) \;, \; P (X = 1 | Y = 0) \;)}} \\ [- 4.5ex] \ end {matrix}} + \ textstyle {\ frac {1} {4}} \ cdot {\ begin {matrix} \ underbrace {H (X | Y = 1)} \\ {} ^ {\ rm {H (\; P (X = 0 | Y = 1) \;, \; P (X = 1 | Y = 1) \;)}} \\ [- 4.5ex] \ end {matrix}}}

{\ displaystyle \ qquad = \ textstyle {\ frac {3} {4}} \ cdot H (\ textstyle {\ frac {2} {3}}, \ textstyle {\ frac {1} {3}}) + { \ begin {matrix} \ textstyle {\ frac {1} {4}} \ cdot \ underbrace {H (1,0)} \\ {} ^ {\ rm {= 0}} \\ [- 4.5ex] \ end {matrix}} = 0 {,} 689 \, bit}

Probability tables:

P (X \| Y)	x = 0	x = 1
y = 0	${\ displaystyle \ textstyle {\ frac {2} {3}}}$	${\ displaystyle \ textstyle {\ frac {1} {3}}}$
y = 1	${\ displaystyle \ textstyle 1}$	${\ displaystyle \ textstyle 0}$

The following applies:
${\ displaystyle P (X | Y) = P (X = x | Y = y)}$
${\ displaystyle = P (x_ {t} = x | x_ {t-1} = y)}$

	y = 0	y = 1
${\ displaystyle P (Y = y)}$	${\ displaystyle \ textstyle {\ frac {3} {4}}}$	${\ displaystyle \ textstyle {\ frac {1} {4}}}$

Two signs considered

Let X: = x _t and Y: = (x _t-2 , x _t-1 ). The following probabilities result:

{\ displaystyle P (X = 0 | Y = (0,0)) = \ textstyle {\ frac {1} {2}} \ qquad P (X = 1 | Y = (0,0)) = \ textstyle { \ frac {1} {2}}}

{\ displaystyle P (X = 0 | Y = (0.1)) = 1 \ qquad P (X = 1 | Y = (0.1)) = 0}

{\ displaystyle P (X = 0 | Y = (1.0)) = 1 \ qquad P (X = 1 | Y = (1.0)) = 0}

Y = (1,1) never appears in the source, so does not need to be considered.

{\ displaystyle H (X | Y) = \ sum _ {y \ in Y} ^ {} P (Y = y) \ cdot H (X | Y = y)}

{\ displaystyle = \ textstyle {\ frac {1} {2}} \ cdot H (X | Y = (0,0)) + \ textstyle {\ frac {1} {4}} \ cdot H (X | Y = (0,1)) + \ textstyle {\ frac {1} {4}} \ cdot H (X | Y = (1,0))}

{\ displaystyle = \ textstyle {\ frac {1} {2}} \ cdot H (\ textstyle {\ frac {1} {2}} | \ textstyle {\ frac {1} {2}}) + {\ begin {matrix} \ underbrace {\ textstyle {\ frac {1} {4}} \ cdot H (1,0)} \\ {} ^ {\ rm {= 0}} \\ [- 4.5ex] \ end { matrix}} + {\ begin {matrix} \ underbrace {\ textstyle {\ frac {1} {4}} \ cdot H (0 | 1)} \\ {} ^ {\ rm {= 0}} \\ [ -4.5ex] \ end {matrix}}}

{\ displaystyle = \ textstyle {\ frac {1} {2}} \, bit}

Probability tables:

P (X \| Y)	X = 0	X = 1
y = (0.0)	${\ displaystyle \ textstyle {\ frac {1} {2}}}$	${\ displaystyle \ textstyle {\ frac {1} {2}}}$
y = (0.1)	${\ displaystyle \ textstyle 1}$	${\ displaystyle \ textstyle 0}$
y = (1.0)	${\ displaystyle \ textstyle 1}$	${\ displaystyle \ textstyle 0}$
y = (1.1)	${\ displaystyle \ textstyle -}$	${\ displaystyle \ textstyle -}$

The following applies: ${\ displaystyle P (X | Y)}$
${\ displaystyle = P (x_ {t} \ mid (x_ {t-2}, x_ {t-1}))}$

	y = (0.0)	y = (0.1)	y = (1.0)	y = (1.1)
P (Y = y)	${\ displaystyle \ textstyle {\ frac {1} {2}}}$	${\ displaystyle \ textstyle {\ frac {1} {4}}}$	${\ displaystyle \ textstyle {\ frac {1} {4}}}$	${\ displaystyle \ textstyle 0}$

The following applies:
${\ displaystyle P (Y) = P (y_ {t}, y_ {t-1})}$

Three characters considered

{\ displaystyle H (X | Y) = 0 \,}

If three consecutive signs are already known, the following sign is also determined (because the source behaves periodically). There is thus no new information about the next character. Accordingly, the entropy must be zero. This can also be seen from the probability table:

P (X \| Y)	X = 0	X = 1
y = (0,0,0)	${\ displaystyle \ textstyle 0}$	${\ displaystyle \ textstyle 1}$
y = (0,0,1)	${\ displaystyle \ textstyle 1}$	${\ displaystyle \ textstyle 0}$
y = (0,1,0)	${\ displaystyle \ textstyle 1}$	${\ displaystyle \ textstyle 0}$
y = (0,1,1)	${\ displaystyle \ textstyle -}$	${\ displaystyle \ textstyle -}$
y = (1,0,0)	${\ displaystyle \ textstyle 1}$	${\ displaystyle \ textstyle 0}$
y = (1,0,1)	${\ displaystyle \ textstyle -}$	${\ displaystyle \ textstyle -}$
y = (1,1,0)	${\ displaystyle \ textstyle -}$	${\ displaystyle \ textstyle -}$
y = (1,1,1)	${\ displaystyle \ textstyle -}$	${\ displaystyle \ textstyle -}$

The following applies:
${\ displaystyle P (X | Y) = P (X = x | Y = y)}$
${\ displaystyle = P (X = x_ {t} | Y = (x_ {t-3}, x_ {t-2}, x_ {t-1}))}$

Impossible events are marked here with "-", e.g. B. at y = (1,0,1). The given source will never deliver this output, since a one is always followed by three zeros.

You can see that in the table there are no other probabilities than 0 or 1. Since according to the definition of entropy, the entropy must ultimately be. ${\ displaystyle H (0.1) = H (1.0) = 0}$ ${\ displaystyle H (X | Y) = 0}$

Explanation of the probability tables

The tables refer to the example character sequence above.

P (X \| Y)	x = 0	x = 1
y = 0	${\ displaystyle \ textstyle {\ frac {2} {3}}}$	${\ displaystyle \ textstyle {\ frac {1} {3}}}$
y = 1	${\ displaystyle \ textstyle 1}$	${\ displaystyle \ textstyle 0}$

The following applies:
${\ displaystyle P (X | Y) = P (X = x | Y = y) = P (X = x_ {t} | Y = x_ {t-1}) = p (x_ {t} | x_ {t -1})}$

Here one considers a character under the condition of the previous character . For example , if there is a character , the question is: What is the probability of the following character or ? The next sign is always for . So is . It also follows that is because the row sum is always one. ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle Y = 1}$ ${\ displaystyle X = 0}$ ${\ displaystyle X = 1}$ ${\ displaystyle Y = 1}$ ${\ displaystyle X}$ ${\ displaystyle 0}$ ${\ displaystyle P (X = 0 | Y = 1) = 1}$ ${\ displaystyle P (X = 1 | Y = 1) = 0}$

P (X)	x _t = 0	x _t = 1
x _t-1 = 0	${\ displaystyle \ textstyle {\ frac {1} {2}}}$	${\ displaystyle \ textstyle {\ frac {1} {4}}}$
x _t-1 = 1	${\ displaystyle \ textstyle {\ frac {1} {4}}}$	${\ displaystyle \ textstyle 0}$

The following applies:
${\ displaystyle P (X) = P (X = (x_ {t}, x_ {t-1})) = P (p (x_ {t}), p (x_ {t-1})) = p ( x_ {t}, x_ {t-1})}$

Here one looks at the frequency of occurrence of a character combination. You can read from the table that the letter combinations (0,1) and (1,0) occur just as often. The sum of all matrix entries is one.

Entropy and information content

In this example, the entropy falls all the more the more characters are taken into account (see also: Markov process ). If the number of characters taken into account is chosen to be sufficiently large, the entropy converges to zero.

One would like to the information content of the given string of n = 12 characters calculate, is obtained according to the definition I _ges = n⋅H (X | Y) at ...

... no characters considered, 9.39 bit total information. (Information content of statistically independent events)

... one considered character 8.26 bit total information.

... two characters considered, 6 bit total information.

... three characters taken into account, 0 bit total information.

literature

Martin Werner: Information and Coding. Basics and Applications, 2nd edition, Vieweg + Teubner Verlag, Wiesbaden 2008, ISBN 978-3-8348-0232-3 .
Karl Steinbuch, Werner Rupprecht: communications engineering. An introductory presentation, Springer Verlag, Berlin / Heidelberg 1967.
R. Mathar: Information Theory. Discrete models and processes, BG Teubner Verlag, Stuttgart 1996, ISBN 978-3-519-02574-0 .

Web links

Lecture notes: common entropy and conditional entropy . (accessed on January 19, 2018)
Conditional entropy (accessed January 19, 2018)
Statistical Methods in Language Processing (accessed January 19, 2018)
Data Security and Shannon's Theory (accessed January 19, 2018)
Probability, Statistics, Induction (accessed January 19, 2018)