Median

In statistics , the median - also known as the central value - is a mean value and a position parameter . The median of the measured values in an original list is that measured value that is exactly “in the middle” when the measured values are sorted according to size. For example, for the unordered original list 4, 1, 37, 2, 1, the measured value 2 is the median, and the central value in the ordered original list is 1, 1, 2 , 4, 37.

In general, a median divides a data set, sample, or distribution into two equal parts such that the values in one half are no greater than the median value and the other half is no smaller.

description

The median divides a list of values into two parts. It can be determined in the following ways:

All values are ordered (ascending).
If the number of values is odd, the middle number is the median.
If the number of values is even, the median is usually defined as the arithmetic mean of the two middle numbers, which are then called the lower and upper median .

An important property of the median is robustness against outliers .

Example: Seven unsorted measured values 4, 1, 15, 2, 4, 5, 4 are sorted by size: 1, 2, 4, 4 , 4, 5, 15; The median (including the upper and lower median) is the value in the middle, i.e. 4. If a 4 was replaced by 46 in the example due to an error, the median does not change: 1, 2, 4, 4 , 5 , 15, 46. The arithmetic mean, however, jumps from 5 to 11.

Comparison with other measures of central tendency

Comparison between mode, median and “mean” (actually: expected value ) of two log-normal distributions with median 1

The median is a special quantile , namely the ½ quantile. Other important measures of position are the arithmetic mean and the mode .

Compared to the arithmetic mean, often called the average, the median is more robust against outliers (extremely deviating values) and can also be applied to ordinally scaled variables. The term median (from Latin medianus , `` in the middle '', `` the middle one '') comes from geometry , where it also denotes a boundary between two halves of the same size.

Areas of application

The median of this table of grades is 3−. A little less than half of the results are worse; by adding the grade 3−, half is just exceeded.

In contrast to the arithmetic mean, the median can also be used for ordinally scaled variables such as grade levels for which there is no quantitative difference. But the median can also be used for interval and ratio- scaled data and then has disadvantages and advantages over the arithmetic mean as a measure of position. The median cannot be used for nominally scaled variables whose characteristics do not have a natural ranking, such as a variable country of birth . Here the mode value is the only measure of position that can be determined.

The median is used in statistics and probability theory in three different meanings:

as a measure of the position of descriptive statistics to describe a concrete list of sample values .

in probability theory as the median of a probability distribution or a random variable . Here the median represents an alternative to the expected value for specifying a “mean value”.

in mathematical statistics as the median of a random sample for the robust estimation of unknown distributions.

Median of a sample

A value is the median of a sample if at least half of the sample elements are not greater than and at least half are not less than . ${\ displaystyle m}$ ${\ displaystyle m}$ ${\ displaystyle m}$

If you sort the observation values according to size, that is, if you move on to the sample ordered according to rank , the median for an uneven number of observations is the value of the observation in the middle of this sequence . With an even number of observations, there is no single middle element, but two. Here, the values of the two middle observations and all values in between (although these may not have occurred in any of the observations) are medians of the sample, since the above condition applies to all of these values.

In the case of cardinally scaled measured values (if it makes sense to calculate the difference between measured values), the arithmetic mean of the two mean observed values is usually used in the case of an even number of observations. The median of an ordered sample of measured values is then ${\ displaystyle {\ tilde {x}}}$ ${\ displaystyle (x_ {1}, x_ {2}, \ dotsc, x_ {n})}$ ${\ displaystyle n}$

{\ displaystyle {\ tilde {x}} = {\ begin {cases} x_ {m + 1} & {\ text {for odd n = 2m + 1}} \\ {\ frac {1} {2}} ( x_ {m} + x_ {m + 1}) & {\ text {for even n = 2m}} \ end {cases}}}

This definition has the advantage that, for samples from symmetrical distributions, the arithmetic mean and the median in the expected value are identical.

Upper and lower median

Often one wants to make sure that the median is part of the sample. In this case, as an alternative to the above definition, if there is an even number of elements, either the sub-median or the upper median is selected as the median . In the case of an odd number of observations, the same applies of course as above . ${\ displaystyle n = 2m}$ ${\ displaystyle {\ tilde {x}} _ {u} = x_ {m}}$ ${\ displaystyle {\ tilde {x}} _ {o} = x_ {m + 1}}$ ${\ displaystyle n = 2m + 1}$ ${\ displaystyle {\ tilde {x}} = {\ tilde {x}} _ {u} = {\ tilde {x}} _ {o} = x_ {m + 1}}$

With the help of Gaussian brackets , the indices can also be expressed relatively compactly by themselves: ${\ displaystyle n}$

{\ displaystyle {\ tilde {x}} _ {u} = x _ {\ left \ lfloor {\ frac {n + 1} {2}} \ right \ rfloor}}

{\ displaystyle {\ tilde {x}} _ {o} = x _ {\ left \ lceil {\ frac {n + 1} {2}} \ right \ rceil}}

This median determination plays a major role in database systems , for example , such as B. in SELECT queries using the median of the medians.

properties

The median , and in the case of an even number of measured values, all values with , minimize the sum of the absolute deviations, that is, for any one applies ${\ displaystyle {\ tilde {x}}}$ ${\ displaystyle {\ tilde {x}}}$ ${\ displaystyle {\ tilde {x}} _ {u} \ leq {\ tilde {x}} \ leq {\ tilde {x}} _ {o}}$ ${\ displaystyle x}$

{\ displaystyle \ sum _ {i = 1} ^ {n} | {\ tilde {x}} - x_ {i} | \ leq \ sum _ {i = 1} ^ {n} | x-x_ {i} |.}

The median is the basis of the smallest absolute deviation method and the robust regression method . The arithmetic mean, on the other hand, minimizes the sum of the squares of deviations , is the basis of the least squares method and regression analysis, and is mathematically easier to handle, but not robust against outliers.

As described above, the median can be determined algorithmically by sorting the measured values. This is generally associated with effort , is only possible for special classes of input data (see sorting algorithm ). But there are also algorithms for quantile determination with linear worst-case effort and algorithms for estimation, for example the Cornish-Fisher method . ${\ displaystyle \ Omega (n \ log n)}$ ${\ displaystyle {\ mathcal {O}} (n)}$ ${\ displaystyle {\ mathcal {O}} (n)}$

Median of grouped data

In the social sciences in particular, the median is often estimated for statistics, since not all data are explicitly and precisely given, but are only available grouped in intervals . For example, surveys rarely ask about the exact salary, but only about the income class, i.e. the range in which the salary is. If only the frequencies of each class are known, then the median of such a sample can generally only be determined approximately. Let it be the number of all data, the respective number of data of the -th group and / or the corresponding upper or lower interval limits. First, the median class (or median group ) is now determined, i. i.e. that group into which the median (according to the above, conventional definition) falls, e.g. B. the -th group. The number is determined by the fact that , but holds. If no further information is given about the distribution of the data, z. B. Postulated uniform distribution so that linear interpolation can be used as an aid to obtain an estimate of the median of the grouped data: ${\ displaystyle n}$ ${\ displaystyle n_ {i}}$ ${\ displaystyle i}$ ${\ displaystyle u_ {i}}$ ${\ displaystyle o_ {i}}$ ${\ displaystyle m}$ ${\ displaystyle m}$ ${\ displaystyle \ textstyle \ sum _ {k = 1} ^ {m-1} n_ {k} <{\ frac {n} {2}}}$ ${\ displaystyle \ textstyle \ sum _ {k = 1} ^ {m} n_ {k} \ geq {\ frac {n} {2}}}$

{\ displaystyle x _ {\ mathrm {med}} = u_ {m} + {\ frac {{\ frac {n} {2}} - \ sum \ limits _ {k = 1} ^ {m-1} n_ { k}} {n_ {m}}} \ cdot (o_ {m} -u_ {m})}

If no further information is given about the distribution of the data, any other distribution besides the uniform distribution can also be present and thus any other value in the -th interval can also be the median. ${\ displaystyle m}$

In contrast to the conventional definition of the median, it does not necessarily have to be an element from the actual amount of data, which is usually not even known.

example

Income :

Class ( ) ${\ displaystyle i}$	Range ( to ) ${\ displaystyle u_ {i}}$ ${\ displaystyle o_ {i}}$	Group size ( ) ${\ displaystyle n_ {i}}$
1	at least 0, less than 1500	160
2	at least 1500, less than 2500	320
3	at least 2500, less than 3500	212

Calculate

{\ displaystyle {\ tfrac {n} {2}} = {\ tfrac {212 + 320 + 160} {2}} = {\ tfrac {692} {2}} = 346.}

So the median is in the 2nd class (i.e., ) since the first class has only 160 elements. This results in an estimate for the median ${\ displaystyle m = 2}$

{\ displaystyle x _ {\ mathrm {med}} = 1500 + {\ tfrac {346-160} {320}} \ cdot (2500-1500) = 2081 {,} 25.}

Since the concrete distribution of the data in the intervals is unknown, any other value in the 2nd interval can also be the median. The value 2081.25 calculated as an example can therefore be up to 581.25 too large and up to 418.75 too small, so the estimate error can be up to 28%.

An illustration of this procedure for determining the median for grouped data is the graphic determination with the help of the cumulative curve . The abscissa value that belongs to the ordinate value is searched for here . If the value is smaller and even, the ordinate value can also be selected instead . ${\ displaystyle x _ {\ mathrm {med}}}$ ${\ displaystyle {\ tfrac {n} {2}}}$ ${\ displaystyle n}$ ${\ displaystyle {\ tfrac {n} {2}} + 1}$

Other variants

The welfare function is an alternative to the median when determining mass income from a given income distribution.
Another way to deal with extreme values other than the median is to use a trimmed mean , which is found by removing the smallest and largest values before the calculation (typically 5% of the values are omitted).
Butler also has a stricter definition of median (which is less common) which says the median is the value for which the number of smaller values in the series is equal to the number of larger values in the series . For special cases such as 3, 3, 3, 3, 4 or 1, 2, 3, 3, 3, there is a procedure with which a clear median can be calculated while maintaining the stricter definition.

Median and arithmetic mean: very simple example

In a group of ten people, all people have different monthly incomes. One person receives € 1,000,000, the other nine get € 1,000, € 2,000, € 3,000, etc. to € 9,000.

The arithmetic mean, the "average" - the monthly income of each of the ten people if the sum of all incomes is evenly divided between them - is € 104,500 in this case. Of course, only one of the ten people earns more than this, the other nine significantly less.

The median, on the other hand, is € 5,500. Five people earn more than that, five people less. The median marks the borderline between the higher-earning and the lower-earning half.

Web links

Wiktionary: Median - explanations of meanings, word origins, synonyms, translations

Detailed explanations on the calculation of the median on the “footpath”: Wikibooks
Exploitation of the robust properties of the median using the example of circular adjustment. ( Memento of April 2, 2010 in the Internet Archive ).
Eric W. Weisstein : Statistical Median . In: MathWorld (English).
AV Prokhorov: Median (in statistics) . In: Michiel Hazewinkel (Ed.): Encyclopaedia of Mathematics . Springer-Verlag , Berlin 2002, ISBN 978-1-55608-010-4 (English, online ).

Individual evidence

↑ Hans Lohninger: Basics of Statistics. Average.
↑ Christopher Butler: Statistics in Linguistics . 1985.
↑ Central tendency. Archived from the original on January 16, 2013 ; accessed on May 9, 2016 .

[1] Hans Lohninger: Basics of Statistics. Average.

[2] Christopher Butler: Statistics in Linguistics . 1985.

[3] Central tendency. Archived from the original on January 16, 2013 ; accessed on May 9, 2016 .