Bernstein-von-Mises theorem

from Wikipedia, the free encyclopedia

The Bernstein-von-Mises theorem is a mathematical statistics theorem named after Sergei Bernstein and Richard von Mises . Its illustrative significance is that Bayesian learning , as practiced, for example, by neural networks , leads to the correct results in the long term.

The sentence says that in parametric models the a posteriori distribution is usually concentrated asymptotically (with a large number of observations) around the true parameter ( consistency of the Bayesian estimator ) regardless of the a priori distribution . He thus establishes an important connection between Bayesian statistics and frequentistic statistics .

According to Bernstein-von-Mises' theorem, the correspondingly centered and scaled a posteriori distribution is even asymptotically a normal distribution with the inverse Fisher information matrix as the covariance matrix ( asymptotic efficiency of the Bayesian estimator). Accordingly, optimal frequentistic and Bayesian approaches lead asymptotically to qualitatively equal results in parametric models.

So the a posteriori distribution for the unknown quantities in a problem is in a certain sense independent of the a priori distribution as soon as the amount of information obtained by the sample is large enough.

Application example

In the following, the application of the theorem and the typical procedure of Bayesian inference will be illustrated using a simple example: A random variable is observed and its realization using a set of measurement data from the sample space. These data are to be described by a stochastic model with unknown parameters , which can also be vector-valued. Before the data are collected, both their values ​​and those of the parameter are uncertain, and a common stochastic model for is useful. In this interpretation, the parameter is also a random variable with a prior distribution . Obviously, this is still unknown before the actual data measurement and a “reasonable” a priori assumption has to be made about it. After observing the data, the opinion on the parameter is updated. All available information about is described by the posterior distribution . According to Bayes' theorem, this is given as

,

where the expression represents the so-called likelihood function and describes the distribution of for a given parameter . It is to be hoped that the posterior distribution enables a better and more precise statement about than the original naive prior distribution . This last step is commonly referred to as Bayesian learning and is an essential step in learning in neural networks . If we now assume this last posterior distribution as the new prior distribution, collect another new data set and repeat the above procedure, we get another, updated posterior distribution after another Bayesian learning step. This now contains information from two data sets and should therefore provide an even better and more precise statement about . That the repeated application of this Bayesian learning successfully approximates the actual distribution of is the statement of the Bernstein-von-Mises theorem . The convergence of this method against the actual distribution of is almost certain under certain conditions and is independent of the prior distribution.

Formulation of the sentence

For a set of parameters, let a dominated parametric statistical model be , i. H. each one has a degree on . Here is the parameter value that you actually want to estimate.

We assume that is differentiable in the geometric mean, i.e. H. that there is a vector of functions (the score in ) such that for  :

The score is centered and has a variance , the Fisher information . We assume that this matrix is ​​invertible.

In order to be able to apply Bayes' theorem , we assume that an a priori density law is sufficient, which we assume to be continuous and positive .

Finally, we assume that there is a series of statistical tests for each such that and .

Under these assumptions, the set then states that the calculated with the set of Bayes a posteriori distribution of the observations "asymptotically close" in probability to a normal distribution is the inverse of the variance of the Fischer-information.

Mathematically, this is described using the total variation through the relationship

with .

Caveat

Bernstein's and von Mises' theorem is entirely satisfactory if it is assumed that the parameter is chosen by nature through a random mechanism whose law of probability is known. On the other hand, in some cases it is necessary to identify the exception null set. For example, if the parameter is fixed but unknown and the prior is being used as a convenient method of calculating estimates, it is important to know for which null set the method fails. In particular, it would be desirable to choose the prior so that the exception null set is actually empty. The simplest possible case of this problem, namely that of independent, identically distributed, discrete observations is discussed in.

Bayesian estimates can be inconsistent when the underlying mechanism allows for an infinite number of possible outcomes. However, there are classes of priors ("tailfree priors" and "Dirichlet priors") for which consistency of priors can be proven. However, inconsistent estimates are obtained for other priors, for example those discussed in.

history

The theorem was named after Richard von Mises and Sergei Natanowitsch Bernstein , although the first rigorous proof was given by Joseph L. Doob in 1949 for random variables with finite probability spaces. Later, Lucien Le Cam , his doctoral student Lorraine Schwarz , the mathematicians David A. Freedman and Persi Diaconis generalized the theorem and its assumptions. A remarkable result by David A. Freedman from 1965 should be pointed out: the Bernstein-von-Mises theorem is almost certainly "not applicable" if the random variable lives in an infinitely countable probability space . In other words, in this case, for almost all initial prior distributions, the method does not converge against the true distribution. The clear reason for this is that the information learned in a respective Bayesian learning step is of degree 0 . A negative consequence of this is already evident in the case of high-dimensional but finite problems, as Persi Diaconis and David A. Freedman note in their 1986 publication in the last sentence of the summary:

“Unfortunately, in high-dimensional problems, arbitrary details of the prior can really matter; Indeed, the prior can swamp the data, no matter how much data you have. That is what our examples suggest, and that is why we advise against the mechanical use of Bayesian nonparametric techniques. "

“Unfortunately, in high-dimensional problems, the precise details of the prior distribution are really important. Because the prior can actually push the data in the wrong direction, no matter how much data is available. This is what our examples suggest, and why we advise against simply applying Bayesian non-parametric techniques mechanically. "

The well-known statistician AWF Edwards once remarked similarly: “Sometimes, in defense of the Bayesian concept, it is said that the choice of the prior distribution is irrelevant in practice because it hardly affects the posterior distribution if there is enough data. The less that is said about this 'defense', the better. "

literature

  • David A. Freedman: On the asymptotic behavior of behavior of Bayes estimates in the discrete case. In: The Annals of Mathematical Statistics , vol. 34, 1963, pp. 1386-1403, doi : 10.1214 / aoms / 1177703871 JSTOR 2238346 .
  • David A. Freedman: On the asymptotic behavior of behavior of Bayes estimates in the discrete case II. In: The Annals of Mathematical Statistics , vol. 36, 1965, pp. 454-456, doi : 10.1214 / aoms / 1177700155 JSTOR 2238150 .
  • Lucien Le Cam: Asymptotic Methods in Statistical Decision Theory. Springer, 1986, ISBN 0-387-96307-3 , pp. 336 and 618-621.
  • Lorraine Schwartz: On Bayes procedure. In: Z. Probability Theory , 1965, No. 4, pp. 10-26.

Web links

Individual evidence

  1. ^ AW van der Vaart: Asymptotic Statistics . Cambridge University Press, 1998, ISBN 0-521-78450-6 , 10.2 Bernstein-von-Mises theorem.
  2. Freedman, op. Cit.
  3. ^ Diaconis, Freedman, op. Cit.
  4. ^ Joseph L. Doob: Applications of the theory of martingales . In: Colloq. Intern. du CNRS (Paris) . 13, 1949, pp. 22-28.
  5. Persi Diaconis, David A. Freedman: On the consistency of Bayes estimates . In: The Annals of Statistics . 14, 1986, pp. 1-26. JSTOR 2241255 .
  6. ^ AWF Edwards: Likelihood . Johns Hopkins University Press, Baltimore 1992, ISBN 0-8018-4443-6 .