p value

from Wikipedia, the free encyclopedia


The p value (after RA Fisher ), also exceedance probability or significance level called ( p for Latin probabilitas = probability ), is in the statistics and there, especially in the test theory a Evidenzmaß for the credibility of the null hypothesis . It therefore indicates the extent to which the observations support the null hypothesis. Mostly it says that a certain connection does not exist, e.g. B. a new drug is not effective. In addition to its importance as a measure of evidence, the p -value is used as a mathematical aid to determine significance in hypothesis tests, the p -value itself does not have to be assigned any special meaning.

The p value is defined as the probability - on condition that the null hypothesis in fact true - the observed value of the test statistic or to obtain a value "extreme" in the direction of the alternative. The p value then corresponds to the lowest level of significance at which the null hypothesis can just about be rejected. Since the p -value is a probability, it can have values ​​from zero to one. This has the advantage that it enables different test results to be compared. The specific value is determined by the sample drawn . If the p -value is “small” (less than a specified significance level; generally <0.05), the null hypothesis can be rejected. In other words: If the calculated test variable is greater than the critical value (can be read directly from a quantile table ), the null hypothesis can be rejected and one can assume that the alternative hypothesis is valid and that there is a certain relationship (e.g. a new drug is effective). If the null hypothesis is rejected in favor of the alternative hypothesis, the result is described as " statistically significant ". “Significant” here simply means “over- random ” and is not synonymous with “practical relevance” or “scientific significance”. In various scientific disciplines, fixed limits such as 5%, 1% or 0.1% have been established that are used to decide whether or not the null hypothesis can be rejected. The size of the p value does not provide any information about the size of the true effect .

The p -value is very often misinterpreted and used incorrectly, which is why the American Statistical Association felt compelled to publish a notice in 2016 on how to deal with p -values ​​and statistical significance. According to a small Canadian field study from 2019, the terms “ p value” and “statistical significance” are not conveyed correctly in a number of textbooks . Studies by Oakes (1986) and Haller & Krauss (2002) show that a large number of students and teachers of statistics are unable to correctly interpret the p value. Misuse and manipulation of p values ​​(see p hacking and misuse of p values ) is a controversy in meta-research .

Mathematical formulation

In a statistical test , an assumption ( null hypothesis ) is checked by carrying out a suitable random experiment that provides the random variables . These random variables are combined into a single number, called the test variable :

A value is obtained for a specific outcome of the experiment

.

The value is defined as the probability - provided that the null hypothesis actually applies - of receiving the observed value of the test variable or a value that is “more extreme” in the direction of the alternative. For compound null hypotheses, this conditional probability can only be estimated upwards.

For a right-sided test:

For a left-sided test:

And for a two-sided test:

For this implementation in the
rejection area , the value is less than , or equivalent to this, the implementation of the test variable x is greater than the critical value z. Here is the probability density of the distribution under the null hypothesis

Usually, a significance level is set before the test and the value is then compared with this. The smaller the value, the more reason there is to reject the null hypothesis. If the value is less than the specified significance level , the null hypothesis is rejected. Otherwise the null hypothesis cannot be rejected.

According to the frequentist point of view , the value introduced by RA Fisher does not contain any further information; only the fact whether it is less than a given level is of interest. In this form, it is just another formulation that the observed value of the test variable is in the critical region and does not add anything new to the Neyman-Pearson theory of hypothesis testing.

example

Given a coin. The null hypothesis to be tested is that the coin is fair, i.e. that heads and tails are equally likely; the alternative hypothesis is that a result is more likely, without specifying which of the two should be more likely. The random experiment to test the null hypothesis now consists in flipping the coin twenty times. denote the number of throws that produce “heads” as a result. With a fair coin, ten heads would be expected. It is therefore sensible to choose the statistic

.

Assuming the experiment yields the result "head" times, so the realization is from here . Under the null hypothesis, the number of heads is binomially distributed with and . The value for this test output is therefore

.

At a significance level of α = 5% = 0.05, the null hypothesis cannot be rejected, since 0.115> 0.05 (and not less than would be necessary). This means that one can not infer from the data that the coin is not fair.

If the test result were- times upside down, then the value would be for this test output

.

In this case, at a significance level of α = 5% = 0.05, the null hypothesis would be rejected, since 0.041 <0.05; so one would conclude that the coin is not fair. At a significance level of 1%, however, further tests would be necessary. (More precisely, one would consider the data insufficient to justify the conclusion that the coin is not fair. However, to take this as evidence that the coin is fair would be wrong.)

Relationship to the level of significance

There is an equivalence between a test procedure with the calculation of the value and a procedure with the significance level determined in advance. The value is calculated from the observed value of the test variable , and the critical value follows from the significance level . B. on the right:

and

KS test for the "Average house price per district" variable of the Boston Housing data set .

where represents the critical value . In statistical software, the value is given when a test is carried out , see on the right under Asymptotic Significance (last line in the box). If the value is less than the specified significance level , the null hypothesis must be rejected.

On the one hand, the output of the value in a test explicitly relieves the software of asking about the specified level of significance in order to make a test decision. On the other hand, there is the risk that the researcher will adjust the significance level, which is actually to be determined in advance, in order to get the desired result.

Other properties

If the test variable has a continuous distribution, the value, under the (punctiform) null hypothesis, is uniformly distributed over the interval .

Typical misinterpretations

If the null hypothesis is rejected in favor of the alternative hypothesis, the result is described as " statistically significant ". “Significant” here simply means “over- random ”. A common misunderstanding is to equate this statement with the false claim that the value indicates how likely the null hypothesis is if this sample result is received. In fact, the -value indicates how extreme the result is: the smaller the -value, the more the result speaks against the null hypothesis.

Goodman makes 12 statements about values ​​that are extremely common and yet false, such as the following:

  • It is wrong: if , the chance that the null hypothesis is true is only 5%.
  • It is wrong: a non-significant difference means that the mean values ​​are the same when comparing means between two groups.
  • It is also wrong: Only a significant difference means that the result is important in reality, for example in clinical application.

Criticism of the p value

Critics of the value point out that the criterion used to decide on “statistical significance” is based on an arbitrary determination of the significance level (often set to 0.05) and that the criterion leads to an alarming number of false positives Tests leads. The proportion of all “statistically significant” tests in which the null hypothesis is true could be considerably higher than the significance level, which in turn depends on how many of the null hypotheses are false and how high the power of the test is. The division of the results into significant and non-significant results can be highly misleading. For example, the analysis of almost identical data sets can lead to values ​​that differ greatly in significance. In medical research, the value initially represented a considerable improvement over previous approaches, but at the same time, with the increasing complexity of the published articles, it has become important to uncover misinterpretations of the value. It has been pointed out that in research fields such as psychology, where studies typically have low discriminatory power, the use of significance tests can lead to higher error rates. The use of significance tests as the basis of decisions has also been criticized due to the widespread misunderstanding about the process. Contrary to popular belief, the value does not indicate the probability of the null hypothesis being true or false. Furthermore, the definition of the significance threshold should not be arbitrary, but should take into account the consequences of a false positive result.

Web links

Individual evidence

  1. Lothar Sachs , Jürgen Hedderich: Applied Statistics: Collection of Methods with R. 8., revised. and additional edition. Springer Spectrum, Berlin / Heidelberg 2018, ISBN 978-3-662-56657-2 , p. 452
  2. ^ R. Wasserstein, N. Lazar: The ASA's Statement on p-Values: Context, Process, and Purpose. In: The American Statistician. Volume 70, No. 2, 2016, pp. 129-133, doi : 10.1080 / 00031305.2016.1154108 .
  3. S. Cassidy, R. Dimova, B. Giguère, J. Spence, D. Stanley: Failing Grade: 89% of Introduction-to-Psychology Textbooks That Define or explain Statistical Significance Th Su Incorrectly. In: Advances in Methods and Practices in Psychological Science. June 2019, doi: 10.1177 / 2515245919858072 .
  4. ^ Ludwig Fahrmeir , Rita artist, Iris Pigeot , Gerhard Tutz : Statistics. The way to data analysis. 8., revised. and additional edition. Springer Spectrum, Berlin / Heidelberg 2016, ISBN 978-3-662-50371-3 , p. 388.
  5. Besag, Clifford: Sequential Monte Carlo p-value . In: Biometrika No. 78 (2), 1991. pp. 301-304. doi: 10.1093 / biomet / 78.2.301 .
  6. ^ Steven Goodman: A Dirty Dozen: Twelve P-Value Misconceptions. In: Seminars in Hematology. No. 45, 2008. pp. 135-140 ( PDF file ).