Empirical distribution function

An empirical distribution function - also known as the cumulative frequency function or distribution function of the sample - is a function in descriptive statistics and stochastics that assigns to each real number the proportion of sample values that are smaller or equal . The empirical distribution function can be defined in different ways. ${\ displaystyle x}$ ${\ displaystyle x}$

definition

general definition

If the observation values are in the sample, then the empirical distribution function is defined as ${\ displaystyle x_ {1}, \ ldots, x_ {n}}$

{\ displaystyle F_ {n} (x) = {\ frac {{\ text {Number of observation values ​​in the sample}} \ leq x} {n}} = {\ frac {1} {n}} \ sum _ { i = 1} ^ {n} \ mathbf {1} _ {\ {x_ {i} \ leq x \}}}

with if and zero otherwise, d. H. here denotes the indicator function of the quantity . The empirical distribution function thus corresponds to the distribution function of the empirical distribution . ${\ displaystyle \ mathbf {1} _ {\ {x_ {i} \ leq x \}} = 1}$ ${\ displaystyle x_ {i} \ leq x}$ ${\ displaystyle \ mathbf {1} _ {A}}$ ${\ displaystyle A}$

Empirical distribution function for unclassified data.

Alternatively, the empirical distribution function can be defined with the characteristic values and the associated relative frequencies in the sample: ${\ displaystyle a_ {1} <\ ldots <a_ {k}}$ ${\ displaystyle h_ {1}, \ dotsc, h_ {k}}$

{\ displaystyle F_ {n} (x): = {\ begin {cases} 0, & {\ text {falls}} x <a_ {1}, \\\ sum _ {j = 1} ^ {i} h_ {j}, & {\ text {if}} a_ {i} \ leq x <a_ {i + 1}, ~ i \ in \ {1, \ ldots, k-1 \}, \\ 1, & { \ text {falls}} a_ {k} \ leq x. \ end {cases}}}

The function is thus a monotonically increasing right-hand continuous step function with jumps at the respective characteristic values . ${\ displaystyle F_ {n} (x)}$

Definition for classified data

Empirical distribution function for classified data.

Sometimes data is only available in classified form , i. H. There are classes with class limits , class limits and relative class frequencies given . ${\ displaystyle J}$ ${\ displaystyle x_ {j} ^ {u}}$ ${\ displaystyle x_ {j} ^ {o}}$ ${\ displaystyle h_ {j}}$ ${\ displaystyle j = 1, \ ldots, J}$

Then the distribution function is defined as

{\ displaystyle F_ {n} (x): = {\ begin {cases} 0, & {\ text {falls}} x <x_ {1} ^ {u}, \\\ sum _ {j = 1} ^ {i-1} h_ {j} + {\ frac {x-x_ {i} ^ {u}} {x_ {i} ^ {o} -x_ {i} ^ {u}}} h_ {i}, & {\ text {if}} x_ {i} ^ {u} \ leq x <x_ {i} ^ {o}, ~ i \ in \ {1, \ ldots, J \}, \\ 1, & { \ text {falls}} x_ {J} ^ {o} \ leq x. \ end {cases}}}

At the upper and lower class limits, the definition agrees with the definition for unclassified data, but linear interpolation now takes place in the areas in between (see also sum frequency polygon ), which assumes that the observations are evenly distributed within the classes . Empirical distribution functions of classified data (as well as distribution functions of continuous probability distributions, e.g. normal distribution) are continuous , but only differentiable between the class boundaries , their increase corresponding to the height of the respective column of the underlying histogram.

It should be noted, however, that the interval limits of classified data are chosen so that the observed characteristic values lie between and not (as in the case of unclassified data) on the interval limits, which, depending on the choice of class limits, may be easy for one and the same database different sum frequency polygons can arise.

Examples

General case: unclassified data

The horse kick data from Ladislaus von Bortkewitsch should serve as an example . In the period from 1875 to 1894, a total of 196 soldiers died from horse kicks in 14 cavalry regiments of the Prussian army:

Empirical distribution function of the unclassified horse tread data.

year	75	76	77	78	79	80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	${\ displaystyle \ sum}$
dead	3	5	7th	9	10	18th	6th	14th	11	9	5	11	15th	6th	11	17th	12	15th	8th	4th	196

If you write down the table with the characteristic values and relative frequencies, then you get

${\ displaystyle x_ {i}}$	3	4th	5	6th	7th	8th	9	10	11	12	14th	15th	17th	18th
Years	1	1	2	2	1	1	2	1	3	1	1	2	1	1
${\ displaystyle h_ {i}}$	0.05	0.05	0.10	0.10	0.05	0.05	0.10	0.05	0.15	0.05	0.05	0.10	0.05	0.05
${\ displaystyle F_ {n} (x_ {i})}$	0.05	0.10	0.20	0.30	0.35	0.40	0.50	0.55	0.70	0.75	0.80	0.90	0.95	1.00

The last line contains the value of the distribution function in the appropriate place . For example, at the point it results . ${\ displaystyle x = x_ {i}}$ ${\ displaystyle x = 6 {,} 5}$ ${\ displaystyle F_ {n} (6 {,} 5) = 0 {,} 3}$

Classified data

If you classify the data, you get the following data table. The graphic for this can be found in the definition.

from ${\ displaystyle x_ {i} ^ {u}}$	2	4th	6th	8th	10	12	14th	16
to ${\ displaystyle x_ {i} ^ {o}}$	4th	6th	8th	10	12	14th	16	18th
${\ displaystyle h_ {i}}$	0.10	0.20	0.10	0.15	0.20	0.05	0.10	0.10
${\ displaystyle F_ {n} (x_ {i} ^ {o})}$	0.10	0.30	0.40	0.55	0.75	0.80	0.90	1.00

The last line contains the value of the distribution function in the appropriate place . At the point it results . ${\ displaystyle x = x_ {i} ^ {o}}$ ${\ displaystyle x = 6 {,} 5}$ ${\ displaystyle F_ {n} (6 {,} 5) = 0 {,} 3 + {\ tfrac {6 {,} 5-6} {8-6}} \ cdot 0 {,} 1 = 0 {, } 325}$

Convergence properties

The strong law of large numbers ensures that the estimator will almost certainly converge to the true distribution function for any value : ${\ displaystyle F_ {n} (x)}$ ${\ displaystyle x}$ ${\ displaystyle F (x)}$

{\ displaystyle F_ {n} (x) \ {\ xrightarrow {fs}} \ F (x)}

,

d. H. the estimator is consistent. Thus the point-wise convergence of the empirical distribution function against the true distribution function is given. Another, stronger result, the Glivenko-Cantelli theorem, says that this happens evenly: ${\ displaystyle F_ {n} (x)}$

{\ displaystyle \ | F_ {n} -F \ | _ {\ infty} \ equiv \ sup _ {x \ in \ mathbb {R}} {\ big |} F_ {n} (x) -F (x) {\ big |} \ {\ xrightarrow {fs}} \ 0}

.

This property is the mathematical reason why it makes sense at all to describe data with an empirical distribution function.

Ogive

Ogive (distribution function) of a theoretical and an empirical distribution.

Ogive originally referred to the Gothic architectural style element pointed arch and the reinforced ribs in the vaults. The expression was first used in statistics for a distribution function by Francis Galton in 1875 :

"When the objects are marshalled in the order of their magnitude along a level base at equal distances apart, a line drawn freely through the tops of the ordinates .. will form a curve of double curvature ... Such a curve is called, in the phraseology of architects, an 'ogive'. "

- Francis Galton : From Statistics by intercomparison with remarks on the Law of Frequency of Error. , Philosophical Magazine 49, p. 35

The ordered (often grouped) characteristics are plotted on the horizontal axis of the coordinate system; the relative cumulative frequencies in percent on the vertical axis.

The graphic on the right shows the cumulative distribution function of a theoretical standard normal distribution . If the right part of the curve is mirrored at the point (dashed red), the resulting figure looks like an ogive. ${\ displaystyle x = 0}$

An empirical distribution function is shown below. For the graph, 50 random numbers were drawn from a standard normal distribution. The more random numbers you draw, the closer you get to the theoretical distribution function.

literature

Horst Mayer: Descriptive Statistics. Munich - Vienna 1995