Kolmogorov-Smirnov test

from Wikipedia, the free encyclopedia

The Kolmogorow-Smirnow-Test (KS-Test) (after Andrei Nikolajewitsch Kolmogorow and Nikolai Wassiljewitsch Smirnow ) is a statistical test for the correspondence of two probability distributions .

With its help, random samples can be used to check whether

In the context of the latter (single-sample) application problem, one speaks of the Kolmogorow-Smirnow adaptation test (KSA test). Some (parametric) statistical methods assume that the examined variables are normally distributed in the population . The KSA test can be used to test whether this assumption must be rejected or whether it can be retained (taking into account the error ).

Conception

Representation of the Kolmogorow-Smirnow test. The red line is the distribution function , the blue line is the empirical distribution function and the black arrow illustrates the Kolmogorow-Smirnow statistics .

The concept is explained using the adaptation test, whereby the comparison of two features is analogous. One looks at a statistical characteristic whose distribution in the population is unknown. The two-sided formulated hypotheses are then:

Null hypothesis  :

(The random variable has the probability distribution .)

Alternative hypothesis  :

(The random variable has a different probability distribution than .)

The Kolmogorow-Smirnow test compares the empirical distribution function with , by means of the test statistics

where sup denotes the supremum .

According to the Gliwenko-Cantelli theorem , the empirical distribution tends evenly against the distribution function of (i.e. under against ). Under should be larger than under . The test statistic is independent of the hypothetical distribution . If the value of the test statistic is greater than the corresponding tabulated critical value, the null hypothesis is rejected.

Procedure for the one-sample problem (adaptation test)

From a real random variables are observed values ( ) before that are already sorted in ascending order: . The relative cumulative function (cumulative frequency, empirical distribution function ) is determined from these observations . This empirical distribution is compared with the corresponding hypothetical distribution of the population: It is the value of the probability distribution on the site provides: . If this distribution actually obeyed, the observed frequency and the expected frequency should be roughly the same.

If is continuous, the test statistic can be calculated in the following way: For each, the absolute differences

and

calculated ("o" for above, "u" for below), where is set. It is then absolutely the greatest difference from all differences , determined. If exceeds a critical value , the hypothesis is rejected at a significance level .

Until the critical values ​​are tabulated. For larger ones , they can be approximated using the formula . This approximation formula results in the formulas for the range listed in the table below .

How to proceed with the two-sample problem

If, in addition to the above random variable, there is a corresponding random variable (with ordered values ), the two-sample test can be used to check whether and whether the same distribution function follows. The hypotheses are:

Null hypothesis:

(The random variables and have the same probability distribution.)

Alternative hypothesis:

(The random variable has a different probability distribution than .)

The Kolmogorow-Smirnow test compares the empirical distribution functions (relative sum functions) and analogously to the one-shot test based on their absolute differences using the test statistics

.

The null hypothesis is rejected at a significance level if it exceeds the critical value . The critical values ​​are tabulated for small values ​​of and . For large values ​​of and , the null hypothesis is rejected if

,

where for large and approximately can be calculated as.

Numerical example

Comparison of empirical and theoretical distribution of the numerical example: on the left a histogram with normal distribution curve, on the right the theoretical and the empirical distribution function

In a company that produces high-quality perfumes, the quantity of bottles filled was measured as part of quality assurance on a filling plant . It is the characteristic : Filled quantity in ml.

It should be checked whether the known parameters of the distribution of still apply.

First of all, at a significance level , it should be tested whether the characteristic is at all normally distributed in the population with the known parameters and , that is

with as normal distribution symbol. The following table results:

Here the -th observation denotes the value of the sum function of the -th observation and the value of the normal distribution function at the point with the parameters mentioned. The next columns show the differences listed above. The critical value at and led to the rejection, would be the amount . The largest absolute deviation in the table is in the 3rd line. This value is greater than the critical value, so the hypothesis is rejected. So it can be assumed that the distribution hypothesis is wrong. This can mean that the filled quantity is no longer normally distributed, that the average filled quantity has shifted or that the variance of the filled quantity has changed.

Characteristics of the KS test

In the case of the one- sample problem , the KS test, in contrast to the test, is also suitable for small samples.

As a non-parametric test, the Kolmogorow-Smirnow test is very stable and unsusceptible. The test was originally developed for continuously distributed metric features; however, it can also be used for discrete and even ranked features. In these cases the test is a little less selective ; H. the null hypothesis is rejected less often than in the continuous case.

A great advantage is that the underlying random variable does not have to follow a normal distribution . The distribution of the test variable is identical for all ( continuous ) distributions. This makes the test versatile, but it also has its disadvantage, because the KS test generally has a low test strength . The Lilliefors test is an adaptation of the Kolmogorow-Smirnow test for testing for normal distribution with an unknown mean and unknown variance. Possible alternatives to the KS test are the Cramér von Mises test , which is suitable for both use cases, and the Anderson-Darling test for comparing a sample with a hypothetical probability distribution.

Web links

literature

  • Lothar Sachs, Jürgen Hedderich: Applied statistics. 12th, completely revised and expanded edition. Springer, Berlin / Heidelberg 2006, ISBN 978-3-540-32161-3 .

Individual evidence

  1. a b Critical values ​​for the Kolmogorov-Smirnov Test for goodness of fit. Archived from the original on August 18, 2016 ; accessed on December 18, 2016 .
  2. Lothar Sachs , Jürgen Hedderich: Statistics: Applied Statistics . 12th edition. Springer, Berlin / Heidelberg 2006, p. 338 .
  3. Pearson ES and Hartley, HO (eds.): Biometrika Tables for Statisticians , Volume 2. Cambridge University Press, 1972, pp. 117-123, Tables 54, 55, ISBN 0-521-06937-8 .
  4. Table of the critical values ​​for the two-sample test ( Memento of the original from June 13, 2013 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. (PDF; 177 kB) @1@ 2Template: Webachiv / IABot / www.soest.hawaii.edu
  5. ^ Jürgen Janssen, Wilfried Laatz: Statistical data analysis with SPSS for Windows . 6th edition. Springer, 2007, p. 569 .