Kullback-Leibler divergence

The terms Kullback-Leibler divergence ( KL divergence for short ), Kullback-Leibler entropy , Kullback-Leibler information , Kullback-Leibler distance (according to Solomon Kullback and Richard Leibler ) or information gain denote a measure of the difference between two probability distributions . Typically, one of the distributions represents empirical observations or a precise probability distribution, while the other represents a model or an approximation.

Note : The KL divergence is also called relative entropy , whereby the term relative entropy is occasionally also used for transinformation .

Formally, the KL divergence for the probability functions and discrete values can be determined as follows: ${\ displaystyle P}$ ${\ displaystyle Q}$

{\ displaystyle D (P \ | Q) = KL (P, Q) = \ sum _ {x \ in X} P (x) \ cdot \ log {P (x) \ over Q (x)}}

If the distributions and for continuous values are represented by the probability density functions and , an integral is calculated: ${\ displaystyle P}$ ${\ displaystyle Q}$ ${\ displaystyle p}$ ${\ displaystyle q}$

{\ displaystyle D (P \ | Q) = \ int _ {- \ infty} ^ {\ infty} p (x) \ cdot \ log {\ frac {p (x)} {q (x)}} \; \ mathrm {d} x}

The Kullback-Leibler divergence is from information theory to view how much space is wasted per character on average when one on based coding is applied to a source of information that the actual distribution follows. There is thus a connection to the channel capacity . Mathematically, this is compatible with the statement that the KL is divergence and equality only applies if P and Q are identical. ${\ displaystyle Q}$ ${\ displaystyle P}$ ${\ displaystyle \ geq 0}$

The specific choice of the base of the logarithm in the calculation depends on the information unit to be used for the calculation. In practice, the KL divergence is often given in bits or Shannon and base 2 is used for this, nit (base ) and ban (base 10) are also used more rarely . ${\ displaystyle e}$

Instead of the Kullback-Leibler divergence, the cross entropy is often used. This provides qualitatively comparable values, but can be estimated without precise knowledge of . In practical applications this is advantageous because the actual background distribution of the observation data is mostly unknown. ${\ displaystyle P}$

Although the Kullback-Leibler divergence is sometimes also referred to as the Kullback-Leibler distance, it does not meet a fundamental requirement for distance measures : It is not symmetrical, so it applies in general . Alternatively, to establish symmetry, the sum of the two divergences can be used, which is obviously symmetrical: ${\ displaystyle D (P \ | Q) \ neq D (Q \ | P)}$

{\ displaystyle D_ {2} (P \ | Q) = D_ {2} (Q \ | P) = D (P \ | Q) + D (Q \ | P)}

Multivariate normal distributions

For two multi-dimensional normal distributions (with dimension ), with mean values and (non-singular) covariance matrices , the Kullback-Leibler divergence is given by: ${\ displaystyle k}$ ${\ displaystyle \ mu _ {0}, \ mu _ {1}}$ ${\ displaystyle \ Sigma _ {0}, \ Sigma _ {1}.}$

{\ displaystyle D _ {\ text {KL}} ({\ mathcal {N}} _ {0} \ parallel {\ mathcal {N}} _ {1}) = {\ frac {1} {2}} \ left (\ operatorname {tr} \ left (\ Sigma _ {1} ^ {- 1} \ Sigma _ {0} \ right) + (\ mu _ {1} - \ mu _ {0}) ^ {\ mathsf { T}} \ Sigma _ {1} ^ {- 1} (\ mu _ {1} - \ mu _ {0}) - k + \ ln \ left ({\ frac {\ det \ Sigma _ {1}} { \ det \ Sigma _ {0}}} \ right) \ right).}

example

An example of the application of the Kullback-Leibler distance to assess the similarity of two probability distributions is the comparison of two sets of grooves on a honed surface. For the grooves, a parametric model is first set up for the depth, spacing and width of the grooves. The parameters for the samples to be compared are then estimated. Now the Kullback-Leibler distance between the two distributions can be calculated. Their size indicates the similarity of the distributions of the groups of grooves; if it is 0, both samples come from the same distribution.

supporting documents

S. Kullback, RA Leibler: On information and sufficiency . In: Annals of Mathematical Statistics . tape 22 , no. 1 , March 1951, p. 79-86 .
S. Kullback: Information theory and statistics . Ed .: John Wiley & Sons . 1959.
Springer Online Reference Works. eom.springer.de, accessed on March 31, 2008 (English).

Individual evidence

↑ Duchi J., “ Derivations for Linear Algebra and Optimization, page 13 ”.
↑ Doris Krahe, Jürgen Beyerer: Parametric method to quantify the balance of groove sets of honed cylinder bores . In: Architectures, Networks, and Intelligent Systems for Manufacturing Integration . doi : 10.1117 / 12.294431 ( spie.org [accessed August 2, 2016]).

[1] Duchi J., “ Derivations for Linear Algebra and Optimization, page 13 ”.

[2] Doris Krahe, Jürgen Beyerer: Parametric method to quantify the balance of groove sets of honed cylinder bores . In: Architectures, Networks, and Intelligent Systems for Manufacturing Integration . doi : 10.1117 / 12.294431 ( spie.org [accessed August 2, 2016]).