Measure of dispersion (statistics)

Dispersion measures , including measures of dispersion ( Latin dispersion "dispersion" of dispergere "spread, spread, scatter") or scattering parameters called up in the descriptive statistics different metrics together, or the spread of values of a sample of a frequency distribution to a suitable position parameters describe around . The various calculation methods differ in principle in terms of their ability to be influenced or sensitivity to outliers .

Requirements for a measure of dispersion

It is a sample and a function. is called a measure of dispersion if it generally meets the following requirements: ${\ displaystyle x_ {1}, \ dots, x_ {n} \ in \ mathbb {R}}$ ${\ displaystyle s \ colon \ mathbb {R} ^ {n} \ rightarrow \ mathbb {R}}$ ${\ displaystyle s}$

${\ displaystyle s (x_ {1}, \ dots, x_ {n})}$ is a nonnegative real number that is zero when all observations are the same (there is no variability in the data) and increases as the data becomes more diverse. If at least two characteristic values are different from one another, then the data scatter among one another or around a mean value, which should also be reflected in the degree of scatter. ${\ displaystyle x_ {1} = x_ {2} = \ ldots = x_ {n} = {\ overline {x}}}$
Non-negativity is required for a measure of variance, since with variance, "the extent" instead of "the direction" is constitutive. A measure of dispersion should therefore be greater, the greater the difference between the observed values. The requirement is often even stricter that a measure of dispersion must not decrease when an observation value is replaced by a new feature value.
${\ displaystyle s}$ is translation invariant , d. H. a shift of the zero point has no influence on the distribution. The following must therefore apply: ${\ displaystyle s (x_ {1} + a, \ dots, x_ {n} + a) = s (x_ {1}, \ dots, x_ {n}) \; \; \; \ forall a \ in \ mathbb {R}}$
It is also desirable that the measure of dispersion be invariant to changes in scale.

Dimensions

About the arithmetic mean

Sum of squares of deviations

The most intuitive measure of dispersion is the sum of the squares of deviations. It results as -fold empirical variance ${\ displaystyle (n-1)}$

{\ displaystyle SQ_ {x}: = \ sum \ limits _ {i = 1} ^ {n} (x_ {i} - {\ overline {x}}) ^ {2} = (n-1) s_ {x } ^ {2}}

.

Empirical variance

One of the most important parameters of dispersion is the variance, which is defined in two slightly different variants. The origin of these differences and their use is explained in the main article. The versions are given as

{\ displaystyle {\ tilde {s}} _ {x} ^ {2} = {\ frac {1} {n}} \ sum \ limits _ {i = 1} ^ {n} \ left (x_ {i} - {\ overline {x}} \ right) ^ {2}}

respectively

{\ displaystyle s_ {x} ^ {2} = {\ frac {1} {n-1}} \ sum \ limits _ {i = 1} ^ {n} \ left (x_ {i} - {\ overline { x}} \ right) ^ {2}}

In each case, the arithmetic mean of the sample denotes . ${\ displaystyle {\ overline {x}}}$ ${\ displaystyle (x_ {1}, \ dots, x_ {n})}$

Empirical standard deviation

The standard deviation is defined as the square root of the variance and is therefore available in two versions:

{\ displaystyle {\ tilde {s}} = {\ sqrt {{\ frac {1} {n}} \ sum \ limits _ {i = 1} ^ {n} \ left (x_ {i} - {\ overline {x}} \ right) ^ {2}}}}

respectively

{\ displaystyle s = {\ sqrt {{\ frac {1} {n-1}} \ sum \ limits _ {i = 1} ^ {n} \ left (x_ {i} - {\ overline {x}} \ right) ^ {2}}}}

An essential difference to the empirical variance is that the empirical standard deviation has the same dimension and thus the same units as the sample.

Coefficient of variation

The empirical coefficient of variation is formed as the quotient of the empirical standard deviation and the arithmetic mean : ${\ displaystyle s}$ ${\ displaystyle {\ overline {x}}}$

{\ displaystyle v = {\ frac {s} {\ overline {x}}}, \ quad {\ overline {x}}> 0}

.

It is dimensionless and therefore not subject to units.

Mean absolute deviation

The mean absolute deviation of a random variable from its expected value is defined by ${\ displaystyle e}$ ${\ displaystyle X}$ ${\ displaystyle \ mu = \ operatorname {E} (X)}$

{\ displaystyle \ operatorname {e}: = \ operatorname {E} \ left (\ left | X- \ mu \ right | \ right)}

.

This makes it the first absolute centered moment of the random variable . In the case of a specific sample with the arithmetic mean , it is calculated by ${\ displaystyle X}$ ${\ displaystyle x_ {1}, \ dots, x_ {n}}$ ${\ displaystyle {\ overline {x}}}$

{\ displaystyle \ operatorname {e} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} \ left | x_ {i} - {\ overline {x}} \ right |. }

The mean absolute deviation is usually avoided in mathematical statistics in favor of the quadratic deviation, which is easier to treat analytically. The absolute value function used in the definition cannot be differentiated everywhere, which makes the calculation of the minimum more difficult.

Due to the inequality of the arithmetic-quadratic mean , the mean absolute deviation is less than or equal to the standard deviation (equality only applies to constant random variables).

For symmetric distributions, i.e. H. Distributions with the property for all real , with monotonically decreasing density for , holds ${\ displaystyle f (\ mu -x) = f (\ mu + x)}$ ${\ displaystyle x}$ ${\ displaystyle x> \ mu}$

{\ displaystyle IQR \ leq 2 \ operatorname {e}}

.

The equals sign applies to constant uniform distribution .

Around the median

Quantile spacing

The quantile distance is the difference between the - and - quantile : ${\ displaystyle p}$ ${\ displaystyle \ left (1-p \ right)}$

{\ displaystyle QA_ {p} = Q_ {1-p} -Q_ {p} \;}

With

{\ displaystyle \; 0 \ leq p <0 {,} 5}

Percent of all measured values lie within . ${\ displaystyle QA_ {p}}$ ${\ displaystyle 100 \ cdot (1-2p)}$

Interquartile range

The interquartile range , abbreviated IQR, is calculated as the difference between the quartiles and : ${\ displaystyle Q_ {0 {,} 75}}$ ${\ displaystyle Q_ {0 {,} 25}}$

{\ displaystyle IQR = Q_ {0 {,} 75} -Q_ {0 {,} 25}}

50% of all measured values lie within the IQR. He is - as well as the median and - insensitive to outliers. It can be shown that it has a breaking point of . ${\ displaystyle Q_ {0 {,} 5}}$ ${\ displaystyle \ epsilon ^ {*} = 0 {,} 25}$

The interquartile range is equal to the quantile range ${\ displaystyle QA_ {0 {,} 25}}$

Mean absolute deviation from median

The mean absolute deviation (English mean deviation from the median , abbreviated MD ) from the median is defined by ${\ displaystyle {\ tilde {x}}}$

{\ displaystyle \ operatorname {MD} = \ operatorname {E} \ left (\ left | X - {\ tilde {x}} \ right | \ right)}

In the case of a specific sample, it is calculated by

{\ displaystyle \ operatorname {MD} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} \ left | x_ {i} - {\ tilde {x}} \ right |}

Due to the extremal property of the median , the absolute deviation always applies in comparison with the mean

{\ displaystyle \ operatorname {MD} \ leq \ operatorname {e}}

,

ie the mean absolute deviation with respect to the median is even smaller than the standard deviation.

For symmetrical distributions, the median and the expected value, and thus also , agree. ${\ displaystyle \ operatorname {MD}}$ ${\ displaystyle \ operatorname {e}}$

The following applies to the normal distribution:

{\ displaystyle \ operatorname {MD} = \ operatorname {e} = {\ sqrt {\ frac {2} {\ pi}}} \ cdot \ sigma \ approx 0 {,} 80 \ cdot \ sigma}

Median of the absolute deviations

The mean absolute deviation (engl. Median absolute deviation , also MedMed ), MAD, abbreviated, is defined by

{\ displaystyle P (\ left | X - {\ tilde {x}} \ right | \ leq \ operatorname {MAD}) = 0 {,} 5}

In the case of a specific sample, it is calculated by

{\ displaystyle \ operatorname {MAD} = \ operatorname {median} {\ left | x_ {i} - {\ tilde {x}} \ right |}}

In the case of normally distributed data, the definition results in the following relationship to the standard deviation:

{\ displaystyle \ operatorname {MAD} = z_ {0 {,} 75} \ cdot \ sigma}

${\ displaystyle z_ {0 {,} 75}}$ 0.75 is the percentile of the standard normal distribution and is approximately 0.6745.

The mean absolute deviation is a robust estimate of the standard deviation. It can be shown to have a breaking point of . ${\ displaystyle \ varepsilon ^ {*} = 0 {,} 5}$

Further measures of dispersion

span

The span ( English range ) is calculated as the difference between the largest and the smallest measured value: ${\ displaystyle R}$

{\ displaystyle R = x _ {\ max} -x _ {\ min}}

Since the range is only calculated from the two extreme values, it is not robust against outliers.

Geometric standard deviation

The geometric standard deviation is a measure of the dispersion around the geometric mean .

Graphic forms of representation

Individual evidence

^ Andreas Büchter, H.-W. Henn: Elementary Stochastics - An Introduction . 2nd Edition. Springer, 2007, ISBN 978-3-540-45382-6 , pp. 83 .
↑ Hans Friedrich Eckey et al .: Statistics: Basics - Methods - Examples. , S. 74. (1st edition 1992; 3rd edition 2002 ( ISBN 3409327010 ). The 4th edition 2005 and the 5th edition 2008 appeared under the title Descriptive Statistics: Basics - Methods - Examples).

literature

Günter Buttler, Norman Fickel (2002), “Introduction to Statistics”, Rowohlt Verlag
Jürgen Bortz (2005), Statistics: For human and social scientists (6th edition), Springer Verlag, Berlin
Bernd Rönz, Hans G. Strohe (1994), Lexicon Statistics , Gabler Verlag

Web links

Wiktionary: Scatter - explanations of meanings, word origins, synonyms, translations

[Buechter83-1] Andreas Büchter, H.-W. Henn: Elementary Stochastics - An Introduction . 2nd Edition. Springer, 2007, ISBN 978-3-540-45382-6 , pp. 83 .

[2] Hans Friedrich Eckey et al .: Statistics: Basics - Methods - Examples. , S. 74. (1st edition 1992; 3rd edition 2002 ( ISBN 3409327010 ). The 4th edition 2005 and the 5th edition 2008 appeared under the title Descriptive Statistics: Basics - Methods - Examples).