Sanov's theorem

The set of Sanov is a result of mathematical sub-region of stochastics . It is a central statement of the theory of large deviations theory and shows a close connection to information theory . The proposition formalizes the intuition that the overall probability of a rare event is dominated by the probability of the most plausible partial event. It is named after the Russian mathematician Ivan Nikolajewitsch Sanov (1919–1968).

Introductory example

Be a sequence of coin tosses fair, modeled as independent and identically distributed (iid) Bernoulli variables with success probability , that is . "Head" corresponds to , "Number" to . The strong law of large numbers states that the arithmetic mean ${\ displaystyle X_ {1}, X_ {2}, X_ {3}, \ dots}$ ${\ displaystyle 1/2}$ ${\ displaystyle X_ {1} \ sim \ operatorname {Ber} (1/2)}$ ${\ displaystyle 1}$ ${\ displaystyle 0}$

{\ displaystyle S_ {n} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} X_ {i}}

converges almost certainly to the expected value . However, it does not make any statement about the speed of convergence. Typically, the mean value will be close to , but it cannot be ruled out that it still deviates significantly from the limit value for an arbitrarily large value , that is to say, for example, applies. Sanov's theorem quantifies how quickly the probability of such a deviation decreases . In addition to the asymptotic behavior, one can also ask how likely the mean value for a specific one will deviate. For example, in his famous work The Doctrine of Chances , Abraham de Moivre dealt with a thought experiment of coin tossing. What is the probability of ? ${\ displaystyle \ mathrm {E} [X_ {1}] = \ mathrm {E} [X_ {2}] = \ dots = 1/2}$ ${\ displaystyle 1/2}$ ${\ displaystyle n}$ ${\ displaystyle S_ {n} \ geq 2/3}$ ${\ displaystyle 0}$ ${\ displaystyle n}$ ${\ displaystyle 3600}$ ${\ displaystyle S_ {3600} \ geq 2/3}$

Such questions are as follows maßtheoretisch formalize Let the set of all probability measures on , that is the totality of all Bernoulli distributions . For every positive integer let ${\ displaystyle {\ mathcal {P}}}$ ${\ displaystyle \ {0.1 \}}$ ${\ displaystyle n}$

{\ displaystyle {\ hat {P}} _ {n} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} \ delta _ {X_ {i}}}

the empirical distribution of the first coin tosses, where the Dirac measure denotes the position . Then it always holds and converges according to the law of large numbers . In addition, let the subset of all distributions with expected value be at least . Then the probability that the random measure is in is exactly the probability . ${\ displaystyle n}$ ${\ displaystyle \ delta _ {x}}$ ${\ displaystyle x}$ ${\ displaystyle {\ hat {P}} _ {n} \ in {\ mathcal {P}}}$ ${\ displaystyle {\ hat {P}} _ {n} \, {\ xrightarrow {\ text {fs}}} \, \ operatorname {Ber} (1/2)}$ ${\ displaystyle \ textstyle \ Gamma = \ {\ mathrm {Ber} (p) \, | \, p \ geq 2/3 \} \ subseteq {\ mathcal {P}}}$ ${\ displaystyle 2/3}$ ${\ displaystyle \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma]}$ ${\ displaystyle {\ hat {P}} _ {n}}$ ${\ displaystyle \ Gamma}$ ${\ displaystyle \ mathrm {P} [S_ {n} \ geq 2/3]}$

Final fall

Let a finite set and the set of all probability measures be provided with the weak topology (cf. convergence in distribution ). Let further be a sequence of iid random variables , where according to a fixed one is distributed, and let be the empirical distribution of . Finally, denote the Kullback-Leibler divergence from to for a probability measure . ${\ displaystyle {\ mathcal {X}}}$ ${\ displaystyle {\ mathcal {P}}}$ ${\ displaystyle {\ mathcal {X}}}$ ${\ displaystyle (X_ {i}) _ {i \ geq 1}}$ ${\ displaystyle X_ {1} \ sim P}$ ${\ displaystyle P \ in {\ mathcal {P}}}$ ${\ displaystyle {\ hat {P}} _ {n}}$ ${\ displaystyle X_ {1}, \ dots, X_ {n}}$ ${\ displaystyle Q \ in {\ mathcal {P}}}$ ${\ displaystyle \ textstyle D (Q \ | P) = \ sum _ {x \ in {\ mathcal {X}}} Q (x) \ cdot \ log {Q (x) \ over P (x)}}$ ${\ displaystyle P}$ ${\ displaystyle Q}$

Under these assumptions, Sanov's theorem says that holds for every set ${\ displaystyle \ Gamma \ subseteq {\ mathcal {P}}}$

{\ displaystyle - \ inf _ {Q \ in \ operatorname {int} (\ Gamma)} \, D (Q \ | P) \ leq \ liminf _ {n \ to \ infty} \, {\ frac {1} {n}} \ log \ left (\, \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \, \ right) \ leq \ limsup _ {n \ to \ infty} \, {\ frac {1} {n}} \ log \ left (\, \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \, \ right) \ leq - \ inf _ {Q \ in \ operatorname {cl} (\ Gamma)} \, D (Q \ | P).}

Here is the inside and the conclusion of . In addition, if the left and right sides of the chain of inequalities match, then the limit exists and it holds ${\ displaystyle \ operatorname {int} (\ Gamma)}$ ${\ displaystyle \ operatorname {cl} (\ Gamma)}$ ${\ displaystyle \ Gamma}$

{\ displaystyle \ lim _ {n \ to \ infty} \, {\ frac {1} {n}} \ log \ left (\, \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \, \ right) = - \ inf _ {Q \ in \ Gamma} \, D (Q \ | P).}

Remarks

The specific choice of the base of the logarithm is irrelevant, but it must be ensured that the same is used as for the divergence (cf. Shannon (unit) ).

From the finiteness of and the continuity of it follows that the infimum over the interior of can really be greater. ${\ displaystyle {\ mathcal {X}}}$ ${\ displaystyle D (\, \ cdot \, \ | P)}$ ${\ displaystyle \ textstyle \, \ inf _ {Q \ in \ operatorname {cl} (\ Gamma)} \, D (Q \ | P) = \ inf _ {Q \ in \ Gamma} \, D (Q \ | P)}$ ${\ displaystyle \ Gamma}$

If convex , then the measure is well defined and is the information projection (engl. Projection information ) of on called. ${\ displaystyle \ Gamma}$ ${\ displaystyle \ textstyle P ^ {*} = {\ arg \ min} _ {Q \ in \ Gamma} \, D (Q \ | P)}$ ${\ displaystyle P}$ ${\ displaystyle \ Gamma}$

Except for sublinear additive terms in the exponent (i.e. sub-exponential factors), asymptotic applies if the divergence is given in Nat . It can even show that for each applies ${\ displaystyle \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \ simeq \ mathrm {e} ^ {- n \ inf D (Q \ | P)}}$ ${\ displaystyle n}$ ${\ displaystyle \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \ leq (n + 1) ^ {| {\ mathcal {X}} |} \ cdot \ mathrm {e } ^ {- n \ inf D (Q \ | P)}.}$

The empirical measure cannot assume arbitrary values, but always lies in , the elements of are called types . The probability that a specific type is can be estimated by. ${\ displaystyle {\ hat {P}} _ {n}}$ ${\ displaystyle \ textstyle {\ mathcal {P}} _ {n} = \ {Q \ in {\ mathcal {P}} \, | \, \ forall x \ in {\ mathcal {X}} \ colon \, nQ (x) \ in \ {0,1, \ dots, n \} \}}$ ${\ displaystyle {\ mathcal {P}} _ {n}}$ ${\ displaystyle {\ hat {P}} _ {n}}$ ${\ displaystyle Q \ in {\ mathcal {P}} _ {n}}$ ${\ displaystyle (n + 1) ^ {- | {\ mathcal {X}} |} \ mathrm {e} ^ {- nD (Q \ | P)} \ leq \ mathrm {P} [{\ hat {P }} _ {n} = Q] \ leq \ mathrm {e} ^ {- nD (Q \ | P)}}$

Overall, the probability is thus dominated by the type that has the smallest divergence from the "true" distribution . Any other type with has an exponentially smaller probability. ${\ displaystyle \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma]}$ ${\ displaystyle Q \ in \ Gamma}$ ${\ displaystyle P}$ ${\ displaystyle Q '}$ ${\ displaystyle D (Q '\ | P) \ geq D (Q \ | P) + \ varepsilon}$

General case

Sanov's theorem can be generalized considerably, in particular the finiteness of the basic set is unnecessary. Let now be an arbitrary Polish space ( e.g. der ), the set of all probability measures with the weak topology and again a sequence of iid -valent random variables with for one . Every absolutely continuous measure has a Radon-Nikodým density with respect to , the divergence is then explained by, for all other measures one can safely set. Note that the empirical measures are obviously always absolutely continuous with respect to. ${\ displaystyle {\ mathcal {X}}}$ ${\ displaystyle \ mathbb {R} ^ {d}}$ ${\ displaystyle {\ mathcal {P}}}$ ${\ displaystyle {\ mathcal {X}}}$ ${\ displaystyle (X_ {i}) _ {i \ geq 1}}$ ${\ displaystyle {\ mathcal {X}}}$ ${\ displaystyle X_ {1} \ sim P}$ ${\ displaystyle P \ in {\ mathcal {P}}}$ ${\ displaystyle Q \ ll P}$ ${\ displaystyle \ textstyle f}$ ${\ displaystyle P}$ ${\ displaystyle \ textstyle D (Q \ | P) = \ int f \ log f \, \ mathrm {d} P}$ ${\ displaystyle D (Q \ | P) = + \ infty}$ ${\ displaystyle {\ hat {P}} _ {n} \ ll P}$ ${\ displaystyle P}$

These customized requirements of the states set of Sanov again that for any amount applies ${\ displaystyle \ Gamma \ subseteq {\ mathcal {P}}}$

{\ displaystyle - \ inf _ {Q \ in \ operatorname {int} (\ Gamma)} \, D (Q \ | P) \ leq \ liminf _ {n \ to \ infty} \, {\ frac {1} {n}} \ log \ left (\, \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \, \ right) \ leq \ limsup _ {n \ to \ infty} \, {\ frac {1} {n}} \ log \ left (\, \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \, \ right) \ leq - \ inf _ {Q \ in \ operatorname {cl} (\ Gamma)} \, D (Q \ | P).}

If the closure is inside, then applies ${\ displaystyle \ Gamma = \ operatorname {cl} (\ operatorname {int} (\ Gamma))}$

{\ displaystyle \ lim _ {n \ to \ infty} \, {\ frac {1} {n}} \ log \ left (\, \ mathrm {P} [{\ hat {P}} _ {n} \ in \ Gamma] \, \ right) = - \ inf _ {Q \ in \ Gamma} \, D (Q \ | P).}

literature

Thomas M. Cover , Joy A. Thomas: Elements of Information Theory . 2nd Edition. John Wiley & Sons , Hoboken, NJ, USA 2006.
Amir Dembo , Ofer Zeitouni : Large Deviations Techniques and Applications (= Stochastic Modeling and Applied Probability . No. 38 ). 2nd Edition. Springer , Berlin & Heidelberg, Germany 2010, doi : 10.1007 / 978-3-642-03311-7 .
Jean-Dominique Deuschel, Daniel W. Stroock : Large Deviations (= Pure and Applied Mathematics . No. 137 ). Academic Press , Boston, MA, USA 1989, doi : 10.1016 / S0079-8169 (08) 61645-1 .

Individual evidence

↑ ^a ^b Ramon van Handel: Stochastic Analysis Seminar - Lecture 3. Sanov's Theorem. Princeton University, Princeton, NJ, USA, October 10, 2013 .; accessed on March 4, 2020
^ Hugo Touchette: Large Deviation Theory: History, Sources, References, and Pointers. Division of Applied Mathematics, Stellenbosch University, Stellenbosch, South Africa; accessed on March 4, 2020
↑ ^a ^b Ivan N. Sanov: On the Probability of Large Deviations of Random Variables . In: North Carolina State University, Department of Statistics (Ed.): Institute of Statistics Mimeo Series . No. 192 , 1958 (English, ncsu.edu - Russian: О вероятности больших отклонений случайных величин . 1957. Translated by Dana EA Quade).
^ ^A ^b ^c Thomas M. Cover, Joy A. Thomas: Elements of Information Theory . 2nd Edition. John Wiley & Sons, Hoboken, NJ, USA 2006, p. 362 f .
^ Imre Csiszár, František Matúš: Information Projections Revisited . In: IEEE Transaction on Information Theory . tape 49 , no. 6 , 2003, p. 1474-1490 , doi : 10.1109 / TIT.2003.810633 .

[Stochastic_Analysis-1] Ramon van Handel: Stochastic Analysis Seminar - Lecture 3. Sanov's Theorem. Princeton University, Princeton, NJ, USA, October 10, 2013 .; accessed on March 4, 2020

[Touchette-2] Hugo Touchette: Large Deviation Theory: History, Sources, References, and Pointers. Division of Applied Mathematics, Stellenbosch University, Stellenbosch, South Africa; accessed on March 4, 2020

[Sanov-3] Ivan N. Sanov: On the Probability of Large Deviations of Random Variables . In: North Carolina State University, Department of Statistics (Ed.): Institute of Statistics Mimeo Series . No. 192 , 1958 (English, ncsu.edu - Russian: О вероятности больших отклонений случайных величин . 1957. Translated by Dana EA Quade).

[Elements-4] A ^b ^c Thomas M. Cover, Joy A. Thomas: Elements of Information Theory . 2nd Edition. John Wiley & Sons, Hoboken, NJ, USA 2006, p. 362 f .

[Information_Projection-5] Imre Csiszár, František Matúš: Information Projections Revisited . In: IEEE Transaction on Information Theory . tape 49 , no. 6 , 2003, p. 1474-1490 , doi : 10.1109 / TIT.2003.810633 .