Walsh outlier test

The Walsh outlier test is a statistical test that can be used to identify outliers in a sample . It does not require a specific frequency distribution of the data and is therefore one of the non-parametric methods . The test was developed by the American statistician John E. Walsh , who first described it in 1950.

The Walsh outlier test is not affected by the problem of most other outlier tests, which are based on the assumption of a normal distribution and can lead to false positive results in samples whose values are, for example, lognormally distributed . The prerequisite for the test application, however, is a sample size of more than 60 values for a significance level of α = 0.10 and more than 220 values for α = 0.05.

In addition, the number of assumed outliers must be specified a priori in order to carry out the test . The test's null hypothesis is the assumption that all observations belong to the sample and that the sample does not contain any outliers. The alternative hypothesis, on the other hand, is that the highest or lowest individual values corresponding to the number of assumed outliers given for performing the test are actually outliers.

Test execution

Null hypothesis	Alternative hypothesis
${\ displaystyle H_ {0} ^ {\ mathrm {min}}:}$ The smallest values belong to a distribution. ${\ displaystyle r}$	${\ displaystyle H_ {1} ^ {\ mathrm {min}}:}$ The smallest values do not belong to a distribution; are outliers. ${\ displaystyle r}$
${\ displaystyle H_ {0} ^ {\ mathrm {max}}:}$ The largest values belong to a distribution. ${\ displaystyle r}$	${\ displaystyle H_ {1} ^ {\ mathrm {max}}:}$ The largest values do not belong to a distribution; are outliers. ${\ displaystyle r}$

The following calculation steps are carried out:

${\ displaystyle c = \ lfloor {\ sqrt {2n}} \ rfloor}$ with the largest whole number less than (round down), ${\ displaystyle \ lfloor x \ rfloor}$ ${\ displaystyle x}$
${\ displaystyle k = c + r}$ ,
${\ displaystyle b = {\ sqrt {1 / \ alpha}}}$ and
${\ displaystyle a = {\ frac {1 + b {\ sqrt {\ frac {cb ^ {2}} {c-1}}}} {cb ^ {2} -1}}}$ .

Applies now

${\ displaystyle x _ {(r)} - (1 + a) x _ {(r + 1)} + ax _ {(k)} <0}$ then the null hypothesis at the significance level can be rejected or ${\ displaystyle H_ {0} ^ {min}:}$ ${\ displaystyle \ alpha}$
${\ displaystyle x _ {(n + 1-r)} - (1 + a) x _ {(nr)} + ax _ {(n + 1-k)}> 0}$ then the null hypothesis at the significance level can be rejected. ${\ displaystyle H_ {0} ^ {max}:}$ ${\ displaystyle \ alpha}$

The value indicates the smallest observation of the sample; see also rank (statistics) . ${\ displaystyle x _ {(i)}}$ ${\ displaystyle i}$

Since the value must be, must apply: . Therefore, for a significance level of at least 61 observations are required, for a significance level of at least 221 observations. ${\ displaystyle a> 0}$ ${\ displaystyle \ alpha> {\ frac {1} {\ lfloor {\ sqrt {2n}} \ rfloor -1}}}$ ${\ displaystyle \ alpha = 10 \, \%}$ ${\ displaystyle \ alpha = 5 \, \%}$

example

If , and then , , , . Ie if ${\ displaystyle n = 75}$ ${\ displaystyle \ alpha = 10 \, \%}$ ${\ displaystyle r = 2}$ ${\ displaystyle c = 12}$ ${\ displaystyle k = 14}$ ${\ displaystyle b = 3 {,} 1623}$ ${\ displaystyle a = 2 {,} 348}$

${\ displaystyle x _ {(2)} - 3 {,} 348x _ {(3)} + 2 {,} 348x _ {(14)} <0}$ then it is discarded or ${\ displaystyle H_ {0} ^ {min}:}$
${\ displaystyle x _ {(74)} - 3 {,} 348x _ {(73)} + 2 {,} 348x _ {(62)}> 0}$ then it is discarded. ${\ displaystyle H_ {0} ^ {max}:}$

Math background

Walsh considers a linear combination of order statistics of form ${\ displaystyle X _ {(i)}}$

{\ displaystyle L = X _ {(r)} - (1 + a) X _ {(j)} + aX _ {(k)}}

with and .

{\ displaystyle 1 <j <k}

{\ displaystyle a> 0}

If the null hypothesis holds, then follows if it should be minimal. If it also applies , then using the Chebyshev inequality it follows : ${\ displaystyle H_ {0} ^ {min}}$ ${\ displaystyle j = r + 1}$ ${\ displaystyle Var (L) (1 + o (1))}$ ${\ displaystyle E (L) = K {\ sqrt {Var (L) (1 + o (1))}}}$

{\ displaystyle P (X _ {(r)} - (1 + a) X _ {(r + 1)} + aX _ {(k)} <0) = P (L <0) = P \ left ({\ frac {LE (L)} {\ sqrt {Var (L)}}} <- K + o (1) \ right) \ leq {\ frac {1} {K ^ {2}}} + o (1)}

.

However, some, not very restrictive, requirements must be met:

If the inverse distribution function of the population or its first derivative is, then for (possibly with ) under must apply ${\ displaystyle Q (p)}$ $Q (p)$ ${\ displaystyle Q '(p)}$ ${\ displaystyle Q '(p)}$ ${\ displaystyle r <s}$ $r <s$ ${\ displaystyle o ({\ sqrt {n}})}$ ${\ displaystyle o ({\ sqrt {n}})}$ ${\ displaystyle H_ {0}}$ $H_ {0}$
- ${\ displaystyle E (X _ {(r)}) = Q \ left ({\ frac {s} {n + 1}} \ right) - {\ frac {sr} {n-1}} Q '\ left ( {\ frac {s} {n + 1}} \ right) (1 + o (1))}$ ,
- ${\ displaystyle Var (X _ {(r)}) = {\ frac {r} {(n + 1) ^ {2}}} \ left (Q '\ left ({\ frac {s} {n + 1} } \ right) \ right) ^ {2} (1 + o (1))}$ ,
- ${\ displaystyle Var (X _ {(s)}) = {\ frac {s} {(n + 1) ^ {2}}} \ left (Q '\ left ({\ frac {s} {n + 1} } \ right) \ right) ^ {2} (1 + o (1))}$ ,
- ${\ displaystyle Cov (X _ {(r)}, X _ {(s)}) = {\ frac {r} {(n + 1) ^ {2}}} \ left (Q '\ left ({\ frac { s} {n + 1}} \ right) \ right) ^ {2} (1 + o (1))}$ such as
- analogous conditions for and . ${\ displaystyle n + 1-r}$ ${\ displaystyle n + 1-s}$
For the terms can be neglected and it then results . ${\ displaystyle \ lfloor {\ sqrt {2n}} \ rfloor> K ^ {2} +1}$ ${\ displaystyle o (1)}$ ${\ displaystyle kr \ leq {\ sqrt {2n}}}$

literature

John Edward Walsh: Some Nonparametric Tests of whether the Largest Observations of a Set are too Large or too Small . In: Annals of Mathematical Statistics . tape 21 , no. 4 , 1950, ISSN 0003-4851 , pp. 583-592 , doi : 10.1214 / aoms / 1177729753 .
John Edward Walsh: Correction to "Some Nonparametric Tests of Whether the Largest Observations of a Set Are Too Large or Too Small" . In: Annals of Mathematical Statistics . tape 24 , no. 1 , 1953, p. 134-135 , doi : 10.1214 / aoms / 1177729095 .
John Edward Walsh: Large Sample Nonparametric Rejection of Outlying Observations. In: Annals of the Institute of Statistical Mathematics. 10/1958. The Institute of Statistical Mathematics, pp. 223-232, ISSN 0020-3157
Large sample outlier detection. In: Douglas M. Hawkins: Identification of Outliers. Chapman & Hall, London and New York 1980, ISBN 0-41-221900-X , pp. 83/84

Web links

Basics of statistics - Walsh outlier test Description of the test procedure