Information criterion

from Wikipedia, the free encyclopedia

In the statistics one's information criterion is a criterion for model selection . One follows the idea of Ockham's razor that a model should not be unnecessarily complex and balances the goodness of fit of the estimated model to the available empirical data ( sample ) and its complexity, measured by the number of parameters . The number of parameters is taken into account as a “penalty”, since otherwise complex models with many parameters would be preferred. In this sense, the corrected coefficient of determination , which goes back to Henri Theil (1970), is a forerunner of the information criteria known today.

All the information criteria used today have the same thing as the fact that they are available in two different formulations. The measure for the goodness of fit is formulated either as the “ maximum plausibility ” or as the “minimum variance ” of the residuals . This results in different possibilities for interpretation. In the case of the former, the “best” model is the one in which the respective information criterion has the highest value (the “punitive” number of parameters must be subtracted). In the case of the latter, the model with the lowest value of the information criterion is best (the number of parameters must be added as a "penalty").

Akaike information criterion

The historically oldest criterion was established in 1973 by Hirotsugu Akaike (1927-2009) as to information criterion proposed and today is as Akaike Information Criterion , information criterion of Akaike , or Akaike'sches Information Criterion ( English Akaike information criterion , in short: AIC ) is known. The Akaike information criterion is one of the most frequently used criteria for model selection in the context of likelihood-based inference.

There is a distribution of a variable with an unknown density function in the population . In the case of maximum likelihood estimation (ML estimation), a known distribution with an unknown parameter is assumed ; so one assumes that the density function can be written as . The Kullback-Leibler divergence is used as a measure of the distance between and . Here, the estimated parameters from the maximum likelihood estimate. The better the ML model, the smaller the Kullback-Leibler divergence .

For the case of a regular and linear model, Akaike was able to show that the negative log-likelihood function (also called the logarithmic plausibility function) is a biased estimator for the Kullback-Leibler divergence and that the bias is asymptotic (sample size tends towards infinity) the number of parameters to be estimated converges. For a maximum likelihood model with a p -dimensional parameter vector, the Akaike information criterion is defined as

,

where represents the log likelihood function. The criterion is oriented negatively, ie when selecting possible candidates for models ( model selection ) for the data, the preferred model is the one with the minimum AIC value. The AIC rewards the goodness of fit (as judged by the likelihood function ), but it also includes a penalty term (also Pönalisierungsterm or Penalisierungsterm called) , which in this case punished high complexity model. It is an increasing function depending on the number of parameters being estimated . The penalty term prevents over-fitting , because the increase in the number of parameters in the model almost always improves the fit. Instead of the AIC as defined above , where is the sample size is also used .

general definition

Assume that there are independent observations with mean and variance . The variables are available as potential regressors. Let the specified model be defined by the subset of included explanatory variables with the associated test plan matrix . For the least squares estimator one obtains .

In general, the Akaike information criterion is defined by

,

where is the maximum value of the log-likelihood function, d. i.e., the log-likelihood function if the ML estimators and are inserted into the log-likelihood function. Smaller AIC values ​​go hand in hand with better model adaptation. The number of parameters is here because the variance of the disturbance is also counted as one parameter. In a linear model with normally distributed disturbance variables ( classic linear model of normal regression ) one obtains for the negative log-likelihood function (for the derivation of the log-likelihood function, see maximum likelihood estimation )

and thus

.

Here is the sample size and the variance of the disturbance variables. The variance of the confounding variables is estimated from the regression model using the residual sum of squares (see Unexpected Estimation of the Variance of the Disturbing Variables ). However, it should be noted that the distorted (and not, as usual, the unbiased ) variant of the estimation of the variance of the disturbance variables is.

Bayesian information criterion

The disadvantage of the Akaike information criterion is that the penalty term is independent of the sample size. With large samples, improvements in the log-likelihood or the residual variance are “easier”, which is why the criterion tends to make models with a relatively large number of parameters appear advantageous for large samples. Therefore, the use of the recommended by Gideon Black proposed 1978 Bayesian information criterion , and Bayesian information criterion , Bayesian Information Criterion , or Schwarz Bayesian Information Criterion (short: SBC ) called ( English Bayesian Information Criterion , in short BIC ). For a model with a parameter vector , log-likelihood function and the maximum likelihood estimator , the BIC is defined as

.

or.

With this criterion, the factor of the penalty term grows logarithmically with the number of observations . From as little as eight observations ( ), the BIC punishes additional parameters more severely than the AIC. Formally, the BIC is identical to the AIC, only that the number of parameters is replaced by.

It has the same orientation as AIC, so models with a smaller BIC are preferred.

The latter model is often used, especially in sociology. Kuha (2004) points out the different goals of the two parameters: While the BIC tries to select the model that has the greatest plausibility a posteriori to be the true model , the AIC assumes that there is no true model. Half of the negative BIC is also known as the Schwarz criterion.

Further information criteria

There are also other, less frequently used information criteria, such as:

A statistical test based on information criteria is the Vuong test .

literature

  • Hirotsugu Akaike: Information theory and an extension of the maximum likelihood principle. In: BN Petrov et al. (Ed.): Proceedings of the Second International Symposium on Information Theory Budapest: Akademiai Kiado 1973. pp. 267-281.
  • Kenneth P. Burnham, David R. Anderson: Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer-Verlag, New York 2002, ISBN 0-387-95364-7 .
  • Kenneth P. Burnham / David R. Anderson (2004): Multimodel Inference: Understanding AIC and BIC in Model Selection. In: Sociological Methods and Research. Volume 33, 2004, doi: 10.1177 / 0049124104268644 , pp. 261-304.
  • Jouni Kuha (2004): AIC and BIC: Comparisons of Assumptions and Performance , in: Sociological Methods and Research. Volume 33, 2004, doi: 10.1177 / 0049124103262065 , pp. 188-229.
  • Gideon Schwarz: Estimating the Dimension of a Model. In: Annals of Statistics. 2, No. 6, 1978, doi: 10.1214 / aos / 1176344136 , JSTOR 2958889 , pp. 461-464.
  • David L. Weakliem (2004): Introduction to the Special Issue on Model Selection. In: Sociological Methods and Research, Volume 33, 2004, doi: 10.1177 / 0049124104268642 , pp. 167-187.

Individual evidence

  1. ^ Akaike's information criterion. Glossary of statistical terms. In: International Statistical Institute . June 1, 2011, accessed July 4, 2020 .
  2. ^ Ludwig Fahrmeir , Thomas Kneib , Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2 , p. 664.
  3. ^ Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2 , p. 664.
  4. ^ Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2 , p. 144
  5. ^ Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2 , p. 148
  6. ^ Bayes information criterion. Glossary of statistical terms. In: International Statistical Institute . June 1, 2011, accessed July 4, 2020 .
  7. ^ Leonhard Held and Daniel Sabanés Bové: Applied Statistical Inference: Likelihood and Bayes. Springer Heidelberg New York Dordrecht London (2014). ISBN 978-3-642-37886-7 , p. 230.
  8. ^ Ludwig Fahrmeir, Thomas Kneib , Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2 , p. 677.
  9. ^ Leonhard Held and Daniel Sabanés Bové: Applied Statistical Inference: Likelihood and Bayes. Springer Heidelberg New York Dordrecht London (2014). ISBN 978-3-642-37886-7 , p. 230.
  10. Bastian Popp: Brand Success Through Brand Communities: An Analysis of the Effect of Psychological Variables on Economic Success Indicators.

Web links