Cross-validation process

Cross-validation methods are test methods of statistics or data analysis, which z. B. in data mining , or when reviewing newly developed questionnaires are used. A distinction is made between simple cross-validation, stratified cross-validation and leave-one-out cross-validation.

Problem

The linear interpolation polynomial (blue) as a model for the 10 observations (black) has a large error (underfitting). The quadratic interpolation polynomial (green) was the basis for generating the data. The interpolation polynomial of order 9 (red) interpolates the data itself exactly, but very poorly between the observations (overfitting).

To get a reliable value for the goodness of fit (quality) of a statistical model, there are various methods in statistics. As a rule, key figures are used for this, e.g. B. the (corrected) coefficient of determination in linear regression or the Akaike or Bayesian information criterion in models based on the maximum likelihood method . Such key figures are partly based on asymptotic theory, ie they can only be reliably estimated for large sample sizes. Estimating them for small sample sizes is therefore problematic. Often, the exact number of parameters to be estimated, which is required for the key figure, cannot be calculated; an example of this is nonparametric statistics .

Furthermore, there is the problem that models that are too highly parameterized tend to adapt too much to the data. One example is polynomial interpolation . If one has observations , one can determine an interpolation polynomial so that applies to all . However, the data is interpolated very poorly between the observation points (so - called overfitting ). If one were to calculate the error ( in-sample error ), one would overestimate the model quality. ${\ displaystyle N}$ ${\ displaystyle (x_ {i}, y_ {i})}$ ${\ displaystyle y (x) = b_ {0} + b_ {1} x + ... + b_ {N-1} x ^ {N-1}}$ ${\ displaystyle y (x_ {i}) = y_ {i}}$ ${\ displaystyle i}$

To avoid the problems mentioned above, the data set is divided into two parts. With the first part only the model parameters are estimated and on the basis of the second part the model error is calculated ( out-of-sample error ). The generalization of this procedure is the cross-validation procedure.

Easy cross-validation

The available data volume, consisting of elements, is divided into subsets of the same size as possible . Now test runs are started, in which the -th subset is used as a test set and the remaining subsets are used as training sets. The total error rate is calculated as the average of the individual error rates of the individual runs . This test method is called k-fold cross-validation. ${\ displaystyle N}$ ${\ displaystyle k \ leq N}$ ${\ displaystyle T_ {1}, ..., T_ {k}}$ ${\ displaystyle k}$ ${\ displaystyle i}$ ${\ displaystyle T_ {i}}$ ${\ displaystyle k-1}$ ${\ displaystyle \ {T_ {1}, ..., T_ {k} \} \ setminus \ {T_ {i} \}}$ ${\ displaystyle k}$

Stratified cross-validation

Building on the simple k-fold cross-validation, the k-fold stratified cross -validation ensures that each of the k subsets has approximately the same distribution . This reduces the variance of the estimate.

Leave-one-out cross-validation

Leave- one-out cross validation (LOO-CV) is a special case of k-fold cross validation, in which k = N ( N = number of elements). N runs are thus started and their individual error values result in the total error rate as the mean value .

A disadvantage of this method is that it is no longer possible to stratify the subsets, as is the case with stratified cross-validation. As a result, in extreme cases, this test procedure can deliver incorrect error values. Example: A completely random amount of data with even distribution and only two classes would result in a LOO-CV with an accuracy of about 0. From N elements, N / 2 elements of one class and N / 2 - 1 elements of the other class are used for training. The true error rate would be 50% in each case. Since the classifier always predicts the counterpart to the majority class of the test data , the test delivers the wrong prediction for each class. The total error rate determined from this is therefore 100%.

Another disadvantage is that the number of training runs leads to a very high computational effort.

Application example

A psychologist is developing a new test to measure depression.

In order to check how well the test measures the characteristic to be measured (depression), the first step is to involve a large group of people who know the respective expression of the characteristic (previously determined by experts or another test) take this test.

In the next step, he divides the large group into two randomly compiled subgroups (or k-subsets, see above), let's call them subgroup A and subgroup B. The psychologist now uses the data from subgroup A to create a predictive equation for create the characteristic that the test is to measure. In other words, it forms a rule according to which conclusions can be drawn from the test data of a person about the characteristics of the characteristic sought. He then applies this prediction equation to all members of subgroup B and tries to use the prediction equation developed in subgroup A to infer their respective characteristics from the test data from subgroup B. He then compares the predicted values with the actual ones. The test is validated crosswise, therefore cross-validation. The higher the correspondence between actual and predicted values, the better, more valid, the test.

In a third possible step, the psychologist repeats the procedure with swapped subgroups. He therefore develops a prediction equation from the data of subgroup B, which he checks after subgroup A (= double cross-validation).

Individual evidence

^ Ian H. Witten, Eibe Frank and Mark A. Hall: Data Mining: Practical Machine Learning Tools and Techniques . 3. Edition. Morgan Kaufmann , Burlington, MA 2011, ISBN 978-0-12-374856-0 ( waikato.ac.nz ).

[1] Ian H. Witten, Eibe Frank and Mark A. Hall: Data Mining: Practical Machine Learning Tools and Techniques . 3. Edition. Morgan Kaufmann , Burlington, MA 2011, ISBN 978-0-12-374856-0 ( waikato.ac.nz ).