# Statistical significance

The result of a statistical test is called statistically significant if sample data deviate so strongly from a predefined assumption (the null hypothesis ) that this assumption is rejected according to a predefined rule.

For this purpose, a level of significance, also known as the probability of error , is set beforehand in accordance with current practice . It indicates how likely it is that an exactly applicable statistical null hypothesis ( Hypothesis to be nullified - "Hypothesis that should be rejected [on the basis of the study data]") could be rejected by mistake ( error of type 1 ). On the other hand, the lower the specified significance level, the higher the probability of maintaining a null hypothesis that may be incorrect ( error of type 2 ).

The result of a significance test does not provide any information on questions about the strength of effects, the relevance of results or their transferability to other circumstances. The p-value , which induces statistical significance, is very often misinterpreted and used incorrectly, which is why the American Statistical Association felt compelled to publish a communication on dealing with statistical significance in 2016. According to a small Canadian field study from 2019, the term is not conveyed correctly in a number of textbooks.

## Basics

Statistical significance is checked by statistical tests , which must be selected in such a way that they correspond to the data material and the parameters to be tested with regard to the probability function . Only then is it possible to mathematically correctly calculate the respective p-value from the probability distribution for random variables as the probability of obtaining a sample result like the observed one or a more extreme one due to chance. How high their proportion is to be expected in the case of random samples that are repeated infinitely often from the same population can be specified as a value between 0 and 1. This p-value is calculated on the assumption that the so-called null hypothesis applies.

Using the p-value, the exceeding of a certain probability of error is estimated. This is now the probability that can be determined in advance, the hypothesis: "The differences found came about by chance" - as the null hypothesis - to be rejected, although it is correct. Such an error is also called type I error or α error .

When determining this critical threshold, it makes sense to consider the consequences of the case that it is mistakenly assumed that an observed difference is only random. If you consider these consequences to be serious, you will choose a lower level rather than a higher one, for example 1% rather than 5%, or 0.1% for the maximum permissible error probability . This probability is called the significance level . ${\ displaystyle \ alpha}$

This means : If the null hypothesis is correct, the probability that it will be wrongly rejected (error of type 1) must not be more than 5%. Accordingly, the probability of not rejecting a correct null hypothesis based on the statistical test is then at least 95%. ${\ displaystyle \ alpha = 0 {,} 05}$${\ displaystyle 1- \ alpha = 0 {,} 95}$

The level of significance or the probability of error therefore only says with what probability the first type error occurs that the null hypothesis is rejected although it is correct. The significance level does not say the probability with which a hypothesis is correct. If a hypothesis is to be proven to be correct, the probability of the error of the 2nd type that the hypothesis is found to be correct although it is incorrect, the greater the lower the significance level. Example: There is an experiment based on the probability p = ¼. However, the hypothesis p = 1/5 should be proven. The probability that the hypothesis will be found to be correct even though it is wrong is 93% with a significance level of 5% and 99% with a significance level of 1% after carrying out 25 experiments. With 1000 experiments it is still 3.6% with a significance level of 5% and 11.4% with a significance level of 1%. So it is better to prove something by rejecting the null hypothesis. Example: 25% of the students in a school use an internal school network. After a promotion, a survey of 50 students polled found that 38% of them use the network. Now you can test for p = 0.25 and at a significance level of 5% with a probability of 95% or at a significance level of 1% with a probability of 99%, say that the number of students who use the network, actually increased through the promotion if the null hypothesis p = 0.25 is rejected. However, it cannot be said that the rate has increased to 38%.

If the application of the statistical procedure shows that the observed difference tested is not statistically significant, no definitive conclusions can be drawn from it. In this case, too, the probability of a type 2 error ( ) is usually not even known to consider a false null hypothesis to be correct. ${\ displaystyle \ operatorname {Pr} (H_ {0} | {\ overline {H}} _ {0}) = \ beta}$

In more general terms, statistical significance describes the possible information content of an event or a measurement against the background of random distributions as probability. The smaller is, the higher the information quality of a significant result. ${\ displaystyle \ alpha}$

The decisive question for the qualitative assessment is: "What does the statistical significance depend on?"

First and foremost, the size of a sample, its representativeness and its variance should be mentioned here. The statistical significance is significantly influenced by the sample size. If only a small sample is examined instead of a larger sample, it is more likely that its composition does not represent the population. The differences that arise as a result of random selections are more significant. If the selected sample represents the basic population in its essential characteristics, one speaks of a representative sample. The variance, i.e. the spread of the values ​​within the examined group, is also important for the information quality.

## Exemplary questions

• A survey found that 55% of women tend towards party A , while 53% of men prefer party B. Is there really a difference in the political convictions of men and women or was it just by chance that many women supporters of Party A and men of Party B were interviewed?
• With a new drug, the cure rate is higher than without drug. Is the new drug really effective or was it just by chance that a particularly large number of patients were selected who would have recovered on their own?
• A certain disease is particularly common in the vicinity of a chemical plant. Is that a coincidence or is there a connection?

## Probability of error and level of significance

In the examples above, one cannot be sure that chance did not affect the results. However, one can estimate how likely it is that the measured results will occur if only chance works. This random error is generally referred to as type 1 error ( synonym : error) and the probability of its occurrence - provided that the null hypothesis is correct - as the probability of error . ${\ displaystyle \ alpha}$

In a parametric model, the probabilities for the various false conclusions depend on the unknown distribution parameter and can be specified with the help of the quality function of the test. ${\ displaystyle \ vartheta}$

The upper limit for the probability of error , i.e. the value that one is just ready to accept for the probability of an error of the first kind, is called the significance level . In principle, this can be freely selected; a significance level of 5% is often used. The establishment of this value is variously attributed to RA Fisher . In practice, this criterion means that on average one out of 20 studies in which the null hypothesis is correct (e.g. a drug is actually ineffective) comes to the conclusion that it is wrong (e.g. claims that the drug is wrong increase the chances of recovery).

A heuristic motivation for the value 5% is as follows: A normally distributed random variable assumes a value that differs from the expected value by more than 1.96 times the standard deviation with a probability of less than or equal to (≤) 5% :

• With a p-value of ≤ 5%, e.g. B. Jürgen Bortz of a significant ,
• a value of ≤ 1% (2.3 standard deviations) is called a very significant and
• a value of ≤ 0.1% (3.1 standard deviations) is considered to be a highly significant result.

It is important here that this classification is purely arbitrary, must be adapted to the respective application and should be confirmed by repetitions. Furthermore, this classification is problematic with regard to publication bias and p-hacking . Since with a p-value of less than or equal to 5%, if the null hypothesis is correct, on average 5% of all investigations nevertheless reject the null hypothesis, this criterion is generally not sufficient to substantiate new discoveries. For example, a much stricter criterion of 5 standard deviations (corresponding to a p-value of 1 in 3.5 million) was used to prove the existence of the Higgs boson .

The level of significance of a result is the opposite of the numerical value of the significance level - a low level of significance corresponds to a high level of significance and vice versa.

In contrast to Fisher's view of significance as a yardstick for the truthfulness of a hypothesis, in the context of a classic, strict Neyman-Pearson test theory, a subsequent classification of the test result in different degrees of significance is not provided. From this point of view, no “highly significant” or “highly significant” results are possible - additional information (for example the p-value) would have to be specified differently.

Even with statistically significant statements, a critical review of the test setup and implementation is always necessary. Scientific investigations are only rarely sufficient. B. the mathematical requirements for a meaningful statistical test . In many studies, the wish of the person carrying out the study (e.g. in the context of a doctoral thesis ) for a "significant" result is too much in the foreground when carrying out the study. Investigations that confirm the null hypothesis are generally (but incorrectly from a statistical point of view) viewed as uninteresting and superfluous. The study design is also crucial. The characteristics “ randomized ”, “controlled” and “ double-blind ” can be used as indicators of the quality of a study (e.g. in a medical environment) . Without this, statements about the effectiveness of therapies must be treated with extreme caution.

In the case of frequently conducted, less elaborate studies, there is still the risk that, for example, out of twenty comparable studies only one will be published - the one with a positive result, although its significance was actually only achieved by chance. This problem is the main cause of the publication bias (see below). The interpretation of significant correlations in retrospective studies is particularly problematic . It should also be borne in mind that statistically significant correlations are often wrongly used to conclude that there is an alleged causality (so-called sham correlation ).

## Interpretation problems

### Informative value and selectivity

Even in studies that are statistically significant, the practical information value can be low.

Studies with a large number of cases often lead to highly significant results due to the high degree of selectivity of a test (also called test strength). Such studies can still be of little informative value if the size of the observed effect or the measured parameter are not relevant. Statistical significance is therefore a necessary but not a sufficient criterion for a statement that is also practically relevant. The effect size (effect size) is an important tool for assessing relevance .

Further critical touchstones from a methodological point of view are:

• the correctness of the statistical model assumptions (for example the distribution assumption )
• the number of statistical tests carried out (if there are several tests, not one of which is clearly identified as the primary test, an adjustment of the significance level should be carried out)
• the prospective definition of the analytical methods, prior to the “unblinding” of double-blind studies
• the possible consequences that can arise from a type 1 or type 2 error, including possible dangers to health and life.

### Erroneous assumptions

Contrary to a widespread opinion, significance is not to be equated with the probability of error, even if in the output of some statistical programs (e.g. SPSS ) the probability of error is misleadingly referred to as "Sig." Or "Significance". It is correct to speak of “significant” if the probability of error for the result obtained from a particular study is not above the previously defined level of significance.

However, it is possible that a repetition of this study with the same design and under otherwise the same conditions in the new sample would yield a result for which the probability of error would be above the level of significance. In the case of randomly distributed variables, the probability for this case depends on the selected significance level.

It is not uncommon for the word significant to mean 'clearly'. A statistically significant change does not necessarily have to be clear, but only unambiguous. So it may well be a minor change that has been clearly measured. If the number of measurements is high enough, every (existing) effect will be measured statistically significant, however small and insignificant it may be.

Furthermore, the assumptions that determine the level of significance or the observed p-value are not applicable

• the effect size
• the probability that the null hypothesis is true or false
• the probability that the alternative hypothesis is true or false

### Scientific publishing

The presentation of statistically significant results has an impact on whether a scientific article is published. However, this leads to what is known as “ publication bias ”, since possible random results are not put into perspective by publishing the entire range of investigations carried out. In addition, results that are selected for publication on the basis of significance usually have overestimated effect sizes . The reason for this is that, especially in smaller studies, only the largest differences or the strongest correlations become significant.

### Significance and Causality

The significance says nothing about the possible causal connections or their type; this is often overlooked.

As an example: A statistic would have shown that a certain disease occurred particularly frequently in the vicinity of a chemical plant, in such a way that the difference to the normal distribution of this disease in the general population is significant. However, this statistically significant correlation would not necessarily mean that the chemical factory is causally responsible for the increased incidence of the disease.

(1) Because it would also be conceivable that the area around that chemical plant is an unpopular residential area and therefore mainly financially weak families live there who cannot afford to move. In most cases, financially weak families tend to have poorer diets and, as a rule, also have poorer health care than the population average; a number of diseases are favored by it, possibly just the one in question.

(2) It is also conceivable that the disease in some areas z. B. occurs more frequently when a certain population density is exceeded and the associated increased risk of infection; and it is only by chance that the chemical plant is now in such an area with a higher incidence of this infectious disease.

In the first imagined case, there could be a causal connection; however, it would be different from that which would like to be adopted with a view to the statistical investigation. The causality could also be that this chemical plant was built exactly where many socially disadvantaged families live (e.g. because they were less able to defend themselves against the settlement of a factory due to a lack of lobby than the more affluent residents of other residential areas or their members as a possible commodity labor appeared cheaper in price when choosing the location). To regard the chemical plant as the cause of the increased number of cases of illness without further evidence would be a logically incorrect conclusion of the type “ cum hoc ergo propter hoc ”.

In the second imagined case, there would be no causal connection; rather, the so-called target error would be committed: after a significant accumulation of an event (here: the disease) has been determined, another somewhat conspicuous event (now: the chemical plant) is used and interpreted as causally related to the first one. Or even simpler:
Something that is noticed somewhere else will probably be related to something conspicuously different - somehow, preferably: causally and ad hoc (here now - " cum ergo propter " - now here).

## literature

• Erika Check Hayden: Weak statistical standards implicated in scientific irreproducibility. In: Nature . 2013, doi: 10.1038 / nature.2013.14131 .
• David Salsburg: The lady tasting tea. How statistics revolutionized science in the twentieth century. Freeman, New York NY 2001, ISBN 0-7167-4106-7 (popular science).
• RL Wasserstein, RL & NA Lazar 2016. The ASA's Statement on p-Values: Context, Process, and Purpose, The American Statistician, Vol. 70, no. 2, pp. 129-133, doi: 10.1080 / 00031305.2016.1154108 .
• Valentin Amrhein, Fränzi Korner-Nievergelt, Tobias Roth 2017: The earth is flat (p> 0.05): significance thresholds and the crisis of unreplicable research. PeerJ 5: e3544, doi: 10.7717 / peerj.3544 .