Statistical test

from Wikipedia, the free encyclopedia

In test theory , a branch of mathematical statistics , a statistical test is used to make a well-founded decision on the validity or invalidity of a hypothesis on the basis of existing observations . In formal terms, a test is a mathematical function that assigns a decision to an observation result. Since the available data are realizations of random variables , in most cases it cannot be said with certainty whether a hypothesis is true or not. One tries, therefore, to control the probabilities of wrong decisions, which corresponds to a test for a given level of significance . For this reason, one speaks of a hypothesis test or a significance test .

Interpretation of a statistical test

In principle, a statistical test procedure can be compared to a legal procedure. The purpose of the trial is (mostly) to determine whether there is sufficient evidence to convict the accused. It is always assumed that a suspect is innocent, and as long as there is great doubt about the evidence of an actual crime, a defendant will be acquitted. A conviction will only be reached if the indications of a defendant's guilt clearly predominate.

Accordingly, at the beginning of the procedure there are two hypotheses “the suspect is innocent” and “the suspect is guilty”. The former is called the null hypothesis and is provisionally assumed. The second is called the alternative hypothesis . She is the one who is trying to "prove".

In order not to condemn an ​​innocent person too easily, the innocence hypothesis is only rejected when an error is very unlikely. One also speaks of controlling the likelihood of a Type I mistake (i.e. convicting an innocent). Naturally, this asymmetrical procedure increases the likelihood of a type 2 error (ie acquittal of a guilty party) "high". Due to the stochastic structure of the test problem, wrong decisions cannot be avoided, as in court proceedings. In statistics, however, one tries to construct optimal tests that minimize the probability of errors.

An introductory example

An attempt should be made to develop a test of clairvoyant skills.

A test person is shown the back of a randomly chosen playing card 25 times and is asked which of the four colors (clubs, spades, hearts, diamonds) the card belongs to. We state the number of hits .

Since the clairvoyant abilities of the person are to be tested, we are initially assuming the null hypothesis that the test person is not clairvoyant. The alternative hypothesis is accordingly: The test person is gifted clairvoyant.

What does that mean for our test? If the null hypothesis is correct, the test person will only be able to try to guess the respective color. For each card there is a 1/4 chance of guessing the correct suit if there are four suits. If the alternative hypothesis is correct, the person has a greater than 1/4 probability of each card. We call the probability of a correct prediction .

The hypotheses are then:



If the test person names all 25 cards correctly, we will consider them clairvoyant and of course reject the null hypothesis. And with 24 or 23 hits too. On the other hand, if there are only 5 or 6 hits, there is no reason to do so. But what about 12 hits? What about 17 hits? Where is the critical number of hits that we can no longer believe were pure chance hits ?

So how do we determine the critical value ? With (that means that we only want to recognize clairvoyant abilities if all cards have been recognized correctly) one is clearly more critical than with . In the first case a person will be seen as a clairvoyant, in the second case far less.

In practice, it depends on how critical you want to be, i.e. how often you allow the wrong decision of the first kind. With is the probability of such a wrong decision, i.e. the probability that a non-clairvoyant test person guessed correctly 25 times by chance:


so very small. Here, A represents the rejection area . We assume that the test statistic of the test holds that and reject if .

Less critical, with , we get with the binomial distribution ,


a much greater probability.

Before the test, a probability of the first type of error is established. Values ​​between 1% and 5% are typical. Depending on this, it can be determined (here in the case of a significance level of 1%) that

applies. Among all the numbers that fulfill this property, one will ultimately choose the smallest number that fulfills this property in order to keep the probability of the error of the second kind small. In this specific example follows: . A test of this type is called a binomial test , since the number of hits is binomially distributed under the null hypothesis.

Possible wrong decisions

Even if it is desirable for the test to make “correct” decisions based on the available data, there is still the possibility of wrong decisions. In the mathematical model, this means that if the null hypothesis is correct and a decision has been made in favor of the alternative, an error of the first type (α error) has been committed. If you see the null hypothesis confirmed even though it is not correct, you commit a type 2 error (β error).

In statistical practice, this ostensibly symmetrical problem is turned into an asymmetrical one: a significance level α is thus established that provides an upper bound for the probability of an error of the first kind. Tests with this characteristic are called test for level . Subsequently, one tries to obtain an optimal test for the given level by looking for one among all tests for level α that has the lowest probability of a type 2 error.

The formal approach

The general procedure for using a test is as follows:

  1. Formulation of a null hypothesis and its alternative hypothesis
  2. Choice of the appropriate test (test size or test statistic )
  3. Determination of the critical area for the significance level , which must be determined before the sample is realized . The critical area is formed from the values ​​of the test statistics that occur only with a low probability under the null hypothesis.
  4. Calculation of the value of the observation of the test variable from the sample (depending on the test method, for example the value or or or ...).
  5. Making the test decision:
    • If it is not in , it is retained.
    • If in , then one rejects in favor of .

Formal definition of a statistical test

Let be a random variable that maps from a probability space to a measurement space. With additional parametric distribution assumption, so a family of probability measures on , with a bijection between and exists. It is the distribution of . Here is the parameter space, which in practice is usually a subset of the with . Two disjoint subsets and of define the test problem:

  • ,

where denotes the null hypothesis and the alternative hypothesis . Often, but not necessarily, form the two sets and a decomposition of .

A measurable function is called a test . This test function is based on the following interpretation:

  • Reject or discard the null hypothesis
  • Null hypothesis maintained

The set of observation results that lead to a rejection of is called the critical area of the test.

Now be a significance level . Then is a test , a test for the level of the test problem against (also level- test ) when all true


Alternatively, it is also referred to as the scope of the test.

As a rule, one looks for a test whose critical range is for all that fulfill the condition for all and for all the optimality condition


Most often a -dimensional random variable with values ​​in , where denotes the sample size. The formal definition and the practical implementation of a test are often based on a one-dimensional real-valued test statistic .

Asymptotic behavior of the test

In most cases, the exact probability distribution of the test statistic under the null hypothesis is not known. So one is faced with the problem that no critical area can be defined for the given level. In these cases, the class of admissible tests is expanded to include those that asymptotically have the correct level. Formally, this means that one chooses the area K so that the condition applies to all

is satisfied. As a rule, such asymptotic tests are obtained via normal approximation; one tries to transform the test statistic in such a way that it converges to a normal distribution.

Simple examples of this are the single and double t tests for expected values. Here the asymptotic distribution follows directly from the central limit theorem when applied to the arithmetic mean. In addition, there are a number of other statistical methods that allow the derivation of the asymptotic normal distribution even for more complicated functionals. This includes the delta method for nonlinear, differentiable transformations of asymptotically normally distributed random variables:

Be a differentiable function and is an estimator -normalverteilt with asymptotic covariance matrix , then has the following distribution: .

Furthermore, the nonparametric delta method (also: influence function method) has brought some progress:

Let be a functional that depends on the distribution . Be the Gâteaux derivation of statistics at (Hold function) and is Hadamard differentiable with respect , then has the following distribution: .

The delta method allows normal distribution approximations for nonlinear, differentiable transformations (asymptotically) normally distributed random variables, while the influence function method allows such approximations for many interesting characteristics of a distribution. This includes u. a. the moments (for example: variance, kurtosis, etc.), but also functions of these moments (for example: correlation coefficient ).

Another important requirement of a good test is that it becomes more sensitive as the sample size increases. In statistical terms, this means that if a consistent test statistic is present, the probability increases that the null hypothesis will actually be rejected in favor of the alternative hypothesis if it is incorrect. Especially if the difference between the actual behavior of the random variable and the hypothesis is very small, it will only be discovered with a correspondingly large sample size. Whether these deviations are of practical importance and even justify the effort of a large sample depends on the aspect to be examined.

Model choice problem

Most mathematical results are based on assumptions made about certain properties of the random variables observed. Depending on the situation, different test statistics are selected, the (asymptotic) properties of which essentially depend on the requirements for the underlying distribution family. As a rule, these model assumptions have to be empirically checked beforehand in order to be used at all. It is particularly critical that the typical test procedures are subject to strict requirements that are seldom met in practice.

Types and characteristics of tests

Parametric and non-parametric tests

Parametric tests (parametric test method)

In parameter tests, specific values ​​such as variance or mean are of interest. A parametric test method therefore makes statements about population parameters or the constants occurring in the distribution function of an examination variable. To do this, all parameters of the population must be known (which is often not the case). In a parameter test, each of the conceivable samples has the same chance of realization. Parametric tests assume that the observed sample data comes from a population in which the variables or characteristics have a certain level of scale and probability distribution, often interval- scale level and normal distribution . In these cases one is interested in testing hypotheses about certain parameters of the distribution.

If the distribution assumptions made are incorrect, the results of the test are useless in most cases. In particular, the probability of an error of the second type can no longer be sensibly minimized. One then speaks of the fact that the selectivity decreases for many alternatives .

Non-parametric tests

In the case of non-parametric tests (also called parameter-free tests or distribution tests), the type of random distribution is checked: a decision is made as to whether a null hypothesis consisting of observations or frequency distributions that was drawn from a random sample is compatible with a null hypothesis that has been obtained from the Distribution in the population. Nonparametric tests can therefore make do with different assumptions; the set of distributions permitted for hypothesis and alternative cannot be described by a parameter.

Typical examples:

However, since parametric tests often offer better selectivity than nonparametric tests despite their assumptions being violated, the latter are rarely used.

Decision-making scheme for parametric / non-parametric test

In principle, a parametric test is preferred to a non-parametric alternative. A parametric test uses more information than a nonparametric test, which increases the quality of the test (assuming the additional information is correct). The following algorithm (in pseudocode ) can be used to select a parametric test or a non-parametric alternative. If STOP is reached, the algorithm is terminated.

  1. Is the variable not cardinally scaled ?
    1. If so, then test non-parametrically. STOP.
  2. Perform a graphical review of the requirements. Are the test requirements clearly violated?
    1. If so, then check whether the violation can be remedied with a variable transformation. If a corresponding transformation does not make sense, then test nonparametrically. STOP.
  3. Are test biases to be expected due to the sample characteristics?
    1. If so, then test non-parametrically. STOP.
  4. Otherwise test parametrically. Is the alternative hypothesis accepted?
    1. If so, then accept the alternative hypothesis. STOP.
  5. Verification of the requirements of the test by means of appropriate tests. Has at least one requirement not been met?
    1. If so, then keep the null hypothesis . STOP.
  6. Additionally test non-parametrically. Is the result of the parametric test confirmed?
    1. If so, then keep the null hypothesis . STOP.
  7. The alternative hypothesis is accepted. STOP.

Distribution-free and distribution-bound tests

For distributive or parametric tests , the test statistic depends on the distribution of the sample variables , that is, their distribution across the population. A normal distribution is often assumed. An example of a distribution-based test is the F-test for comparing two variances between two normally distributed populations.

In nonparametric tests , also known as nonparametric or parameterless tests , the test statistic does not depend on the distribution of the sample variables . An example of a distribution-free test is the Levene test for comparing two variances of two randomly distributed populations.

Conservative test

In a conservative test, the probability of a Type I error (acceptance of the alternative hypothesis as the result of the test decision, although the null hypothesis is true) is lower than the specified significance level for each sample . The consequence is that the non-rejection area of ​​the null hypothesis is wider than actually necessary. This means that the null hypothesis is rejected less often than given by the significance level . One behaves conservatively and supports the acceptance of the null hypothesis.

An example of a conservative test is the binomial test (test for proportional value, e.g. vs. ). Due to the discreteness of the test statistic you can not achieve that for the critical value applies: . Instead, one demands . In general, the critical value chosen is that value which leads to a significance level of at most . In practice, the level of significance can be significantly below the specified level.

Exact test

In some tests, the distribution of the test statistics is approximated by a different distribution, usually for easier calculation. If, on the other hand, the exact sample distribution is used, one speaks of an exact test . Exact tests are for example the Fisher test or the binomial test .

An example here is the binomial test (test for proportional value, e.g. vs. ). Due to the central limit theorem , the binomially distributed test statistic can be approximated with the normal distribution, e.g. B. if applies. In this case, it may be necessary to use a continuity correction for a better approximation .

One-sided and two-sided tests

In case of a one-dimensional parameter values in the parameter space is called in the two cases, and from a one-sided alternative hypothesis and in the case of a two-sided alternative hypothesis . A specified parameter in . In the first case, the null hypothesis can be of the form or ; in the second case the null hypothesis can be of the form or ; in the third case is the null hypothesis . In this context, one speaks of one-sided and two-sided test problems or, more briefly, of one-sided and two-sided tests .

Overview of tests

The most important tests can be characterized according to various criteria, e.g. B. after

  1. Intended use, e.g. B. testing parameters of a distribution or the distribution itself
  2. Number of samples
  3. Dependency or independence of the samples
  4. Requirements about the population (s)

Unless otherwise stated, all tests in the following overview assume that the observations are distributed independently and identically . The following abbreviations are used:

Non-parametric tests are marked with a yellow background.

Tests for location parameters (mean, median)

test Test regarding Requirements)
For a sample
One-sample t-test Average Normal distribution in the GG or the distribution is sufficient for the ZGS (rule of thumb: sample size greater than 30), the variance of the GG is unknown
One-sample Gaussian test Average Normal distribution in the GG or the distribution is sufficient for the ZGS (rule of thumb: sample size greater than 30), the variance of the GG is known
Sign test Median
For two independent samples
Two-sample t-test Mean values Normal distribution in the GGs or the distributions satisfy the ZGS (rule of thumb: total sample size at least 50), variances in GGs are unknown, but the same
Welch test Mean values Normal distribution in the GGs or the distributions satisfy the ZGS (rule of thumb: total sample size at least 50), variances in GGs are unknown and unequal
Two-sample Gaussian test Mean values Normal distribution in the GGs or the distributions satisfy the ZGS (rule of thumb: total sample size at least 50), variances in GGs are known and equal
Wilcoxon-Mann-Whitney test Mean values ​​and medians Distribution functions are shifted against each other
Median test Medians
For two dependent samples
Two-sample t-test Mean values The difference between the observations is normally distributed or satisfies the ZGS (rule of thumb: sample sizes greater than 30), the variance of the difference is unknown
Two-sample Gaussian test Mean values The difference in the observations is normally distributed or is sufficient for the ZGS (rule of thumb: sample sizes greater than 30), the variance of the difference is known
Wilcoxon signed rank test Medians The difference in the observations is distributed symmetrically
Sign test Medians
For several independent samples
Analysis of variance Mean values Normally distributed GGs, variances in GGs are the same
Kruskal-Wallis test Mean values ​​and medians Distribution functions are shifted against each other
Median test Medians
For multiple dependent samples
Analysis of variance with repeated measurements Average Normally distributed Gs, sphericity
Friedman test Location parameters
Quade test Location parameters

Tests for dispersion

test Test regarding Requirements)
For a sample
F test Variance Normally distributed GG
For two independent samples
F test Variances Normally distributed GGs
For two or more independent samples
χ 2 test by Bartlett Variances Normally distributed GGs
Levene test Variances
For a multivariate sample
Bartlett's test for sphericity Covariance matrix

Tests for connection and association parameters

test Test regarding Requirements)
For two independent samples
Chi-square independence test independence GGs are discretely distributed
Fisher's exact test independence GGs are discretely distributed
Steiger's Z test Bravais-Pearson correlation GGs are bivariate normally distributed
For two dependent samples
McNemar test independence Gs are dichotomous

Adaptation or distribution tests

test Test regarding Requirements)
For a sample
Chi-square goodness-of-fit test prev. distribution GG is discreet
Anderson Darling Test prev. distribution GG is steady
Kolmogorov-Smirnov test prev. distribution GG is steady
Cramér von Mises test prev. distribution GG is steady
Jarque Bera test Normal distribution GG is steady
Lilliefors test Normal distribution GG is steady
Shapiro-Wilk test Normal distribution GG is steady
For two samples
Two-sample Kolmogorov-Smirnov test Identical distributions GGs are steady
Two-sample Cramer von Mises test Identical distributions GGs are steady
For multiple samples
Chi-square homogeneity test Identical distributions GGs are discreet

Tests in regression and time series analysis

test Test regarding Requirements)
Linear regression
global F test "Coefficient of determination" Normally distributed residuals
t test Regression coefficient Normally distributed residuals
Goldfeld-Quandt test Heteroscedasticity Normally distributed residuals
Chow test Structural break Normally distributed residuals
Time series analysis
Durbin-Watson test Autocorrelation Normally distributed residuals, fixed regressors, only 1st order autocorrelation permitted, no heteroscedasticity
Box pierce test Autocorrelation ?
Ljung box test Autocorrelation ?

Various tests

test Test regarding Requirements)
Dichotomous GG
Binomial test Share value GG is dichotomous
Run test Randomness GG is dichotomous
Grubbs test Size or kl. value GG is normally distributed
Walsh test Size or kl. value For a significance level of 5% (10%) at least 220 (60) values ​​are required
General tests of the maximum likelihood theory
Likelihood Quotient Test Coefficient or models
Forest test Coefficient or models
Score test Coefficient or models


Special forms of these tests are:

Multiple test
If, for example, instead of an H-test with more than two independent random samples, several U-tests are used as individual tests, then these individual tests are regarded as multiple tests. It should be noted here that the probability of type 1 errors increases with the number of tests in the case of the individual tests connected in series. This must be taken into account when making a comparison.
Sequential test
In a sequential test, the sample size is not specified. Rather, during the ongoing data acquisition, a test is carried out for each new observation to determine whether a decision can be made for or against the null hypothesis on the basis of the data already collected (see sequential likelihood quotient test ).
Testing Brief description
Test of a sample for membership of the normal distribution
Parametric tests
Test by Cochran / Cochran Q Test for equal distribution of several connected dichotomous variables
Kendall's concordance coefficient / Kendall's W Test for correlation of rankings
Friedman test Test for equality of the location parameter with an unknown but identical distribution in the c-sample case with paired samples
Quade test Test for equality of the location parameter with an unknown but identical distribution in the c-sample case with paired samples

See also


  1. We consider [1 / 4.1] for the parameter range in order to achieve that the null hypothesis and alternative hypothesis cover the entire parameter range. If you deliberately named the wrong color, one could infer clairvoyant abilities, but we assume that the test person wants to achieve the highest possible number of hits.
  2. George G. Judge, R. Carter Hill, W. Griffiths, Helmut Lütkepohl , TC Lee. Introduction to the Theory and Practice of Econometrics. 2nd Edition. John Wiley & Sons, New York / Chichester / Brisbane / Toronto / Singapore 1988, ISBN 0-471-62414-4 , p. 93
  3. Jürgen Bortz , Christof Schuster: Statistics for human and social scientists . 7th edition. Springer, Berlin 2010, ISBN 978-3-642-12769-4 .
  4. a b Jürgen Bortz, Gustav A. Lienert, Klaus Boehnke: Distribution-free methods in biostatistics . 3. Edition. Springer, 2008, p. 35-36 .
  5. J. Hartung: Statistics: teaching and manual of applied statistics . 8th edition. Oldenbourg, 1991, p. 139 .
  6. K. Bosch: Statistics Pocket Book . Oldenbourg, 1992, p. 669 .


  • Joachim Hartung, Bärbel Elpelt, Karl-Heinz Klösener: Statistics. Teaching and handbook of applied statistics [with numerous calculated examples], 15th, revised and expanded edition. Oldenbourg, Munich 2005, ISBN 978-3-486-59028-9 .
  • Horst Rinne: Pocket book of statistics. 4th, completely revised and expanded edition. Harri Deutsch, Frankfurt am Main 2008, ISBN 978-3-8171-1827-4 .

Web links