David-Hartley-Pearson test

The David-Hartley-Pearson test was developed in 1954 by statisticians HA David, HO Hartley and ES Pearson . It represents a statistical procedure for the identification of outliers and specifically checks whether it is probable that an observed extreme value (the smallest or the largest) belongs to a normally distributed population or that it is an outlier.

requirements

In order to be able to make statements about an extreme observation value, the David-Hartley-Pearson test assumes the normal distribution of the underlying population , so it is a parametric test .

hypothesis

The following null hypotheses are set up in the David-Hartley-Pearson test:

{\ displaystyle H_ {0} (1) \ colon \! \ x _ {(1)}}

is not an outlier vs. is an outlier

{\ displaystyle H_ {1} (1) \ colon \! \ x _ {(1)}}

{\ displaystyle H_ {0} (n) \ colon \! \ x _ {(n)}}

is not an outlier vs. is an outlier

{\ displaystyle H_ {1} (n) \ colon \! \ x _ {(n)}}

Here denotes the smallest and the largest observation of the sample . ${\ displaystyle x _ {(1)}}$ ${\ displaystyle x _ {(n)}}$

Test statistics

The following test statistics are used to check the hypotheses and : ${\ displaystyle H_ {0} (1)}$ ${\ displaystyle H_ {0} (n)}$

{\ displaystyle T = {\ frac {R} {s}} = {\ frac {x _ {(n)} - x _ {(1)}} {\ sqrt {\ frac {\ sum _ {i = 1} ^ {n} (x _ {(i)} - {\ overline {x}}) ^ {2}} {n-1}}}}}

,

that is, the range of the sample divided by its standard deviation .

The null hypothesis below the significance level is rejected if: ${\ displaystyle \ alpha}$

{\ displaystyle Q_ {n; 1- \ alpha} <T}

Here denotes the critical value. ${\ displaystyle Q_ {n; 1- \ alpha}}$

If the null hypothesis is rejected, the extreme value that is the greatest distance from the mean value is identified as an outlier. If the smallest and largest values are at the same distance from the mean, both are considered to be outliers.

Critical values

Extensive tables with critical values for the David-Hartley-Pearson test can be found in David et al. (1954). A selection of these is shown in the following table:

${\ displaystyle n}$	${\ displaystyle Q_ {n; 0.90}}$	${\ displaystyle Q_ {n; 0.95}}$	${\ displaystyle Q_ {n; 0.975}}$	${\ displaystyle Q_ {n; 0.99}}$	${\ displaystyle Q_ {n; 0.995}}$	${\ displaystyle n}$	${\ displaystyle Q_ {n; 0.90}}$	${\ displaystyle Q_ {n; 0.95}}$	${\ displaystyle Q_ {n; 0.975}}$	${\ displaystyle Q_ {n; 0.99}}$	${\ displaystyle Q_ {n; 0.995}}$
3	1.997	1.999	2,000	2,000	2,000	17th	4.15	4.31	4.44	4.59	4.69
4th	2.409	2,429	2,439	2,445	2,447	18th	4.21	4.38	4.51	4.66	4.77
5	2.712	2.753	2.782	2.803	2.813	19th	4.27	4.43	4.57	4.73	4.84
6th	2.949	3.012	3.056	3.095	3.115	20th	4.32	4.49	4.63	4.79	4.91
7th	3.143	3.222	3.282	3,338	3.369	30th	4.70	4.89	5.06	5.25	5.39
8th	3.308	3,399	3.471	3.543	3,585	40	4.96	5.15	5.34	5.54	5.69
9	3,449	3,552	3,634	3.720	3,772	50	5.15	5.35	5.54	5.77	5.91
10	3.57	3.69	3.78	3.88	3.94	60	5.29	5.50	5.70	5.93	6.09
11	3.68	3.80	3.91	4.02	4.08	80	5.51	5.73	5.93	6.18	6.35
12	3.78	3.91	4.01	4.14	4.21	100	5.68	5.90	6.11	6.36	6.54
13	3.87	4.00	4.11	4.25	4.33	150	5.96	6.18	6.39	6.64	6.84
14th	3.95	4.09	4.21	4.34	4.44	200	6.15	6.38	6.59	6.85	7.03
15th	4.02	4.17	4.29	4.43	4.53	500	6.72	6.94	7.15	7.42	7.60
16	4.09	4.24	4.37	4.51	4.62	1000	7.11	7.33	7.54	7.80	7.99

example

To illustrate this, the following observed series of measurements (already sorted) are assumed:

Name of the measurement	${\ displaystyle x_ {1}}$	${\ displaystyle x_ {2}}$	${\ displaystyle x_ {3}}$	${\ displaystyle x_ {4}}$	${\ displaystyle x_ {5}}$	${\ displaystyle x_ {6}}$	${\ displaystyle x_ {7}}$	${\ displaystyle x_ {8}}$	${\ displaystyle x_ {9}}$	${\ displaystyle x_ {10}}$	${\ displaystyle x_ {11}}$	${\ displaystyle x_ {12}}$
Measured value (speed in m / s)	36	37	39	39	40	40	41	41	41	42	44	46

From these data results for the test statistics:

{\ displaystyle R = x_ {12} -x_ {1} = 46-36 = 10}

and ,

{\ displaystyle s = {\ sqrt {{\ frac {1} {11}} \ sum _ {i = 1} ^ {n} (x_ {i} - {\ overline {x}}) ^ {2}} } = 2 {,} 74}

so that

{\ displaystyle T = {\ frac {R} {s}} = {\ frac {10} {2 {,} 74}} = 3 {,} 65 <4 {,} 14 = Q_ {12; 0 {, } 99}}

This means that the null hypothesis cannot be rejected and neither the largest nor the smallest value are identified as outliers (at the level of significance ). ${\ displaystyle \ alpha = 0 {,} 01}$

Individual evidence

↑ ^a ^b H. A. David, HO Hartley, ES Pearson: The distribution of the ratio, in a single, normal sample, of range to standard deviation. In: Biometrika. No. 41, 1954, pp. 482-493, doi : 10.1093 / biomet / 41.3-4.482 , JSTOR 2332728 .
↑ ^a ^b ^c J. Hartung: Statistics - teaching and manual of applied statistics. 13th edition. R. Oldenbourg Verlag, Munich / Vienna 2002.

[The_distribution-1] H. A. David, HO Hartley, ES Pearson: The distribution of the ratio, in a single, normal sample, of range to standard deviation. In: Biometrika. No. 41, 1954, pp. 482-493, doi : 10.1093 / biomet / 41.3-4.482 , JSTOR 2332728 .

[Statistik-2] J. Hartung: Statistics - teaching and manual of applied statistics. 13th edition. R. Oldenbourg Verlag, Munich / Vienna 2002.