Distortion-Variance Dilemma

The distortion-variance dilemma describes the problem of minimizing two sources of error at the same time: the distortion and the variance. It complicates the generalization of training data through monitored learning algorithms .

The bias is the error based on incorrect assumptions in the learning algorithm. High distortion can cause an algorithm to fail to model the appropriate relationships between input and output (underfitting).
The variance is the error based on the sensitivity to minor fluctuations in the training data. A high variance causes overfitting : the noise in the training data is modeled instead of the intended output.

The distortion-variance decomposition offers the possibility to analyze the expected error of a learning algorithm with regard to a specific problem and can be represented as the sum of three terms: the distortion, the variance and an irreducible error resulting from the noise within the Problem itself.

The bias-variance dilemma applies to all forms of supervised learning : classification , regression , and structured learning.

The dilemma concerns the areas of statistics and machine learning . It has also been used to explain the effectiveness of heuristics in human learning .

motivation

The bias-variance dilemma is a key problem in supervised learning. Ideally, one would like to choose a model that both precisely records the regularities in the training data and can be generalized to unseen test data. Unfortunately, it is usually impossible to do both at the same time. High variance learning methods may represent their training data well, but risk overfitting if the training data is too noisy or unrepresentative. In contrast, algorithms with a high distortion typically produce simpler models that do not tend to overfit excessively, but may not model their training data sufficiently well and do not capture important regularities.

Models with high variance are usually more complex (e.g. regression with higher order polynomials), which allows them to represent the training data more precisely. They may also represent a large part of the noise in the training data, so that their predictions become less accurate - despite their additional complexity. In contrast, higher-bias models tend to be relatively simple (e.g., low-order polynomial regression or linear regression), but can produce lower variance predictions when applied to new data beyond the training data.

Distortion-variance-decomposition of the square deviation

Given are training data consisting of a set of points with associated real values . It is assumed that there is a functional but noisy relationship between and , which is described by, where the noise term is normally distributed and has zero mean and variance . ${\ displaystyle x_ {1}, \ dots, x_ {n}}$ ${\ displaystyle y_ {1}, \ ldots, y_ {n}}$ ${\ displaystyle x_ {i}}$ ${\ displaystyle y_ {i}}$ ${\ displaystyle y_ {i} = f (x_ {i}) + \ epsilon}$ ${\ displaystyle \ epsilon}$ ${\ displaystyle \ sigma ^ {2}}$

Now, with the help of a learning algorithm, a function should be found that approximates the true function as closely as possible. The accuracy of this approximation is measured using the mean square deviation between and : this is intended to minimize both for and for points outside the sample . An optimal approximation is impossible because it is influenced by the noise . This means that an irreducible error must be accepted for every function that is found . ${\ displaystyle {\ hat {f}} (x)}$ ${\ displaystyle y = f (x)}$ ${\ displaystyle y}$ ${\ displaystyle {\ hat {f}} (x)}$ ${\ displaystyle (y - {\ hat {f}} (x)) ^ {2}}$ ${\ displaystyle x_ {1}, \ dots, x_ {n}}$ ${\ displaystyle y}$ ${\ displaystyle \ epsilon}$

A function that can be generalized to points outside the training data is determined using one of the many algorithms that are used for supervised learning. It turns out that depending on the choice of function, the expected deviation with respect to an unseen sample can be decomposed as follows: ${\ displaystyle {\ hat {f}}}$ ${\ displaystyle {\ hat {f}}}$ ${\ displaystyle x}$

{\ displaystyle {\ begin {aligned} \ mathrm {E} {\ Big [} {\ big (} y - {\ hat {f}} (x) {\ big)} ^ {2} {\ Big]} & = \ mathrm {Bias} {\ big [} {\ hat {f}} (x) {\ big]} ^ {2} + \ mathrm {Var} {\ big [} {\ hat {f}} ( x) {\ big]} + \ sigma ^ {2} \\\ end {aligned}}}

Here describes

{\ displaystyle {\ begin {aligned} \ mathrm {Bias} {\ big [} {\ hat {f}} (x) {\ big]} = \ mathrm {E} {\ big [} {\ hat {f }} (x) -f (x) {\ big]} \ end {aligned}}}

the bias of the learning method and

{\ displaystyle {\ begin {aligned} \ mathrm {Var} {\ big [} {\ hat {f}} (x) {\ big]} = \ mathrm {E} {\ Big [} {\ big (} {\ hat {f}} (x) - \ mathrm {E} [{\ hat {f}} (x)] {\ big)} ^ {2} {\ Big]} \ end {aligned}}}

their variance. The expected value varies for different training data that come from the same distribution. The three terms represent: ${\ displaystyle x_ {1}, \ dots, x_ {n}, y_ {1}, \ dots, y_ {n}}$

the square of the distortion of the learning method, which can be interpreted as an error caused by simplifying assumptions within the process. For example, if a nonlinear function is approximated using a linear model learning method , there will be errors in the estimates due to such an assumption . ${\ displaystyle f (x)}$ ${\ displaystyle {\ hat {f}} (x)}$

the variance of the learning method. This intuitively describes the fluctuation range of the learning method around its expected value.

{\ displaystyle {\ hat {f}} (x)}

the irreducible error . Since all three terms are not negative, this error represents a lower bound for the expected deviation on unseen test data.

{\ displaystyle \ sigma ^ {2}}

The more complex the model function , the more data points it will correctly capture and the less its distortion will be. However, a higher complexity makes the model function fluctuate more in order to better capture the data points, thus increasing their variance. ${\ displaystyle {\ hat {f}} (x)}$

Derivation

The derivation of the distortion variance decomposition for quadratic deviations proceeds as follows. The following is abbreviated to and . According to the shift theorem holds for every random variable ${\ displaystyle f = f (x)}$ ${\ displaystyle {\ hat {f}} = {\ hat {f}} (x)}$ ${\ displaystyle X}$

{\ displaystyle \ mathrm {E} [X ^ {2}] = \ mathrm {Var} [X] + \ mathrm {E} [X] ^ {2}}

Since is deterministic, the following applies . Along with the assumptions and implies this ${\ displaystyle f}$ ${\ displaystyle \ mathrm {E} [f] = f}$ ${\ displaystyle y = f + \ epsilon}$ ${\ displaystyle \ mathrm {E} [\ epsilon] = 0}$

{\ displaystyle \ mathrm {E} [y] = \ mathrm {E} [f + \ epsilon] = \ mathrm {E} [f] = f}

.

Since it follows ${\ displaystyle \ mathrm {Var} [\ epsilon] = \ sigma ^ {2}}$

{\ displaystyle \ mathrm {Var} [y] = \ mathrm {E} [(y- \ mathrm {E} [y]) ^ {2}] = \ mathrm {E} [(yf) ^ {2}] = \ mathrm {E} [(f + \ epsilon -f) ^ {2}] = \ mathrm {E} [\ epsilon ^ {2}] = \ mathrm {Var} [\ epsilon] + \ mathrm {E} [ \ epsilon] ^ {2} = \ sigma ^ {2}}

.

Hence, since and are independent, ${\ displaystyle \ epsilon}$ ${\ displaystyle {\ hat {f}}}$

{\ displaystyle {\ begin {aligned} \ mathrm {E} {\ big [} (y - {\ hat {f}}) ^ {2} {\ big]} & = \ mathrm {E} [y ^ { 2} + {\ hat {f}} ^ {2} -2y {\ hat {f}}] = \ mathrm {E} [y ^ {2}] + \ mathrm {E} [{\ hat {f} } ^ {2}] - \ mathrm {E} [2y {\ hat {f}}] \\ & = \ mathrm {Var} [y] + \ mathrm {E} [y] ^ {2} + \ mathrm {Var} [{\ hat {f}}] + \ mathrm {E} [{\ hat {f}}] ^ {2} -2f \ mathrm {E} [{\ hat {f}}] = \ mathrm {Var} [y] + \ mathrm {Var} [{\ hat {f}}] + (f- \ mathrm {E} [{\ hat {f}}]) ^ {2} \\ & = \ mathrm {Var} [y] + \ mathrm {Var} [{\ hat {f}}] + \ mathrm {E} [f - {\ hat {f}}] ^ {2} = \ sigma ^ {2} + \ mathrm {Var} [{\ hat {f}}] + \ mathrm {Bias} [{\ hat {f}}] ^ {2}. \ end {aligned}}}

Application for classification

The distortion-variance-decomposition was originally formulated as the least squares method for regression analysis . In the case of classification , it is possible to find a similar decomposition. Alternatively, if the classification problem can be formulated as a probabilistic classifier , then the expected quadratic deviation of the predicted probabilities can be decomposed with respect to the true probabilities as before.

Approaches to problem solving

Dimension reduction and feature selection can reduce the variance through model simplification. Likewise, a larger set of training data tends to reduce the variance. Adding features (predictors) reduces the bias at the expense of the additional increased variance. Learning algorithms usually have a few parameters to control the bias and variance:

(Generalized) linear models can be regularized to increase their bias.
In artificial neural networks, the variance increases and the distortion decreases with the number of hidden nodes. As in GLMs, regularization is typically used.
In k-nearest-neighbor models , a high value of k leads to low variance and high bias (see below).
In the case of memory-based learning, regularization can be achieved by combining prototype methods and nearest-neighbor methods.
In decision trees , the depth of the tree determines the variance. Decision trees are often pruned to control variance.

One way to solve the dilemma is to use models with mixed distribution and a combination of different learning algorithms. For example, boosting combines many “weak” models (with high distortion) into a new model that has greater variance than the individual models. Bagging, on the other hand, combines “strong” classifiers (with high variance) in a way that reduces the variance of the newly constructed classifier.

K-nearest neighbor

In the case of the k-nearest neighbor algorithm, there is a closed formula that relates the distortion-variance decomposition to the parameter : ${\ displaystyle k}$

{\ displaystyle \ mathrm {E} [(y - {\ hat {f}} (x)) ^ {2}] = \ left (f (x) - {\ frac {1} {k}} \ sum _ {i = 1} ^ {k} f (N_ {i} (x)) \ right) ^ {2} + {\ frac {\ sigma ^ {2}} {k}} + \ sigma ^ {2}}

where describe the nearest neighbors from within the training data. The distortion (first term) is a monotonically increasing function, while the variance (second term) falls if it is increased. In fact, under “reasonable assumptions”, the bias of the 1-nearest neighbor estimator disappears completely when the size of the training data set approaches infinity. ${\ displaystyle N_ {1} (x), \ dots, N_ {k} (x)}$ ${\ displaystyle k}$ ${\ displaystyle x}$ ${\ displaystyle k}$ ${\ displaystyle k}$

Application to human learning

While the distortion-variance dilemma is largely applied in the area of machine learning, it has also been examined in the context of human perception , especially by Gerd Gigerenzer and colleagues in the area of learned heuristics. They argue that in the case of sparse and poorly characterized training data, the human brain solves the dilemma using experiences gained through high-bias, low-variance heuristics. This reflects the fact that a zero bias approach has poor generalizability to new situations and requires precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but provide better explanations for a larger number of situations.

According to Geman and others, the distortion-variance dilemma implies that skills such as generic object recognition cannot be learned from scratch, but require a certain amount of existing "networking" that is later adapted through experience. This is because model-free approaches that want to avoid high variance require impractically large training data sets to explain.

Individual evidence

↑ ^a ^b ^c ^d Stuart Geman, E. Bienenstock, R. Doursat: Neural networks and the bias / variance dilemma . In: Neural Computation . 4, 1992, pp. 1-58. doi : 10.1162 / neco.1992.4.1.1 .

↑ Bias-variance decomposition. In: Encyclopedia of Machine Learning. Eds. Claude Sammut, Geoffrey I. Webb. Springer 2011. pp. 100-101.

^ ^A ^b ^c Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani: An Introduction to Statistical Learning . Springer, 2013 ( online ).

^ ^A ^b Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning . 2009 ( online ). online ( Memento of the original from January 26, 2015 in the Internet Archive ) Info: The archive link was automatically inserted and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2
^ Sethu Vijayakumar: The Bias Variance Tradeoff . University of Edinburgh. 2007. Retrieved August 19, 2014.
↑ Greg Shakhnarovich: Notes on derivation of bias-variance decomposition in linear regression . 2011. Archived from the original on August 21, 2014. Retrieved on August 20, 2014.
^ Pedro Domingos: A unified bias-variance decomposition . In: ICML ..
^ Giorgio Valentini, Thomas G. Dietterich: Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods . In: JMLR . 5, 2004, pp. 725-775.
↑ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: Introduction to Information Retrieval . Cambridge University Press, 2008, pp. 308-314 ( online ).
^ F. Gagliardi: Instance-based classifiers applied to medical databases: diagnosis and knowledge extraction. Artificial Intelligence in Medicine. Volume 52, No. 3 (2011), pp. 123-139. Thu: 10.1016 / j.artmed.2011.04.002
^ Jo-Anne Ting, Sethu Vijaykumar, Stefan Schaal: Locally Weighted Regression for Control. In: Encyclopedia of Machine Learning. Eds. Claude Sammut, Geoffrey I. Webb. Springer 2011. p. 615.
^ Scott Fortmann-Roe: Understanding the Bias-Variance Tradeoff. 2012. [1]
↑ Gerd Gigerenzer, Henry Brighton: Homo heuristicus: Why biased minds make better inferences . In: Topics in Cognitive Science . tape 1 , no. 1 , 2009, p. 107-143 , doi : 10.1111 / j.1756-8765.2008.01006.x , PMID 25164802 .

Web links

Scott Fortmann-Roe: Understanding the Bias-Variance Tradeoff , June 2012.

[geman-1] Stuart Geman, E. Bienenstock, R. Doursat: Neural networks and the bias / variance dilemma . In: Neural Computation . 4, 1992, pp. 1-58. doi : 10.1162 / neco.1992.4.1.1 .

[2] Bias-variance decomposition. In: Encyclopedia of Machine Learning. Eds. Claude Sammut, Geoffrey I. Webb. Springer 2011. pp. 100-101.

[islr-3] A ^b ^c Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani: An Introduction to Statistical Learning . Springer, 2013 ( online ).