Overfitting

from Wikipedia, the free encyclopedia
Blue: Error with regard to training
data records Red: Error with regard to test data records
If the error with regard to the test data records increases while the error with regard to the training data records falls, then you may find yourself in an overfitting situation.

Overfitting ( English overfitting ) describes a specific correction of a model to a given data set. In statistics , overfitting means specifying a model that contains too many explanatory variables . Other hand, are relevant variables disregarded (see distortion by exuberant variables ), it is called sub-adaptation ( English under fitting ).

Mathematical definition

A hypothesis space and a hypothesis are given . Then overfitted to the training data is called if there is an alternative hypothesis , so that it has a smaller error compared to the training data, but has a smaller error than with regard to the distribution of all instances.

statistics

In multiple regression, overfitting is used to characterize a model that contains additional, irrelevant regressors (explanatory variables). Similarly, a model characterized as underfitting does not contain some or all of the relevant regressors.

By including additional regressors, the coefficient of determination , which measures the quality of the fit of the model to the sample data, cannot decrease. Through random effects, irrelevant regressors can help explain the variance and artificially increase the coefficient of determination.

Overfitting is to be assessed as negative, because the actual (lower) goodness of fit is concealed and the model is better adapted to the sample data, but due to the lack of generality there is no transferability to the population. Regression coefficients erroneously appear to be insignificant because their effect can no longer be estimated with sufficient accuracy. The estimators are inefficient; H. their variance is no longer minimal. At the same time, there is a growing risk that irrelevant variables will appear statistically significant due to random effects. Overfitting thus worsens the estimation properties of the model, in particular also in that an increasing number of regressors reduces the number of degrees of freedom . Large differences between the uncorrected and the corrected coefficient of determination indicate overfitting. Overfitting can v. a. can be counteracted by logical considerations and by applying a factor analysis .

Datasets and over-adapted models

First of all, the selection of the data set , in particular the number of observations, measuring points or samples, is an essential criterion for reliable and successful modeling. Otherwise, the assumptions obtained from this data do not allow any conclusions to be drawn about reality. This also applies in particular to statistical statements.

The maximum possible complexity of the model (without over-adapting) is proportional to the representativeness of the training set and thus also to its size for a given signal-to-noise ratio. From this also creates interdependence of distortion in finite samples ( english finite sample bias ), so that to strive for covering possible and extensive training data collection.

In other words: if you try to search for rules or trends in existing data, you have to choose suitable data. If you want to make a statement about the most common letters in the German alphabet, you shouldn't just look at a single sentence, especially if the letter "E" rarely occurs in it.

Overfitting from too much training

In the case of computer-aided modeling, there is a second effect. A data model is adapted to existing training data in several training steps. For example, with a few dozen handwriting samples, a computer can be trained to correctly recognize and assign handwritten digits (0–9). The aim here is to be able to recognize handwriting from people whose handwriting was not included in the training set.

The following experience is often made: The recognition performance for written digits (unknown people) initially increases with the number of training steps. After a saturation phase, however, it decreases again because the data representation of the computer adapts too much to the spelling of the training data and is no longer based on the underlying forms of the digits to be learned. This process has coined the term overfitting at its core, although the state of overfitting as described above can have a number of reasons.

If the model is not planned to be used beyond the training set, i.e. if only one model is solved for a completed problem, then of course there is no question of overfitting. An example of this would be if only one computer model is sought for the completed set of right-of-way situations in road traffic. Such models are significantly less complex than the above and most of the time you already know the rules, so that programs written by humans are usually more efficient than machine learning.

Cognitive analogy

An over-adapted model may reproduce the training data correctly because it has “learned it by heart”, so to speak. A generalization service, which is equivalent to an intelligent classification, is no longer possible, however. The “memory” of the model is too big so that no rules have to be learned.

Strategies to Avoid Overfitting

As already mentioned, it is beneficial to aim for the lowest possible number of parameters in parametric models. In the case of non-parametric methods, it is also advisable to limit the number of degrees of freedom from the outset. In a multilayer perceptron which, for example, a limitation in the size of the would hidden neuron layers ( English hidden layers ) mean. A reduction in the number of necessary parameters / degrees of freedom can also be made possible in complex cases by carrying out a transformation of the data before the actual classification / regression step. In particular, methods for dimension reduction would be u. U. useful ( principal component analysis , independence analysis , etc.).

Overadjustment in machine learning that is dependent on the training duration can also be prevented by early stopping . For recognition purposes, data records are often not only split twice and assigned to a training and validation set. B. a 3-fold division. The quantities, respectively and exclusively, are used for training, for "real-time control" of the out-of-sample error (and possibly terminating training in the event of an increase) and for the final determination of the test quality.

Noisy (approximately linear) data can be described by both a linear and a polynomial function. Although the polynomial function goes through each data point, unlike the linear one, the linear function describes the course better because it has no major deviations at the ends. If the regression curve were used to extrapolate the data, the overfitting would be even greater.

literature

  • Michael Berthold, David J. Hand: Intelligent Data Analysis: An Introduction . Springer Verlag, 2003, ISBN 3-540-43060-1
  • Tom M. Mitchell: Machine Learning . McGraw-Hill Companies, Inc., 1997, ISBN 0-07-115467-1

Web links

Individual evidence

  1. Backhaus, K., Erichson, B., Plinke, W., Weiber, R .: Multivariate Analysis Methods. An application-oriented introduction. Berlin u. a., 11th edition 2006, pp. 84-85.
  2. Backhaus, K., Erichson, B., Plinke, W., Weiber, R .: Multivariate Analysis Methods. An application-oriented introduction. Berlin u. a., 11th edition 2006, p. 85.
  3. Backhaus, K., Erichson, B., Plinke, W., Weiber, R .: Multivariate Analysis Methods. An application-oriented introduction. Berlin u. a., 11th edition 2006, p. 68.