Empirical distribution function

from Wikipedia, the free encyclopedia

An empirical distribution function - also known as the cumulative frequency function or distribution function of the sample - is a function in descriptive statistics and stochastics that assigns to each real number the proportion of sample values ​​that are smaller or equal . The empirical distribution function can be defined in different ways.

definition

general definition

If the observation values ​​are in the sample, then the empirical distribution function is defined as

with if and zero otherwise, d. H. here denotes the indicator function of the quantity . The empirical distribution function thus corresponds to the distribution function of the empirical distribution .

Empirical distribution function for unclassified data.

Alternatively, the empirical distribution function can be defined with the characteristic values and the associated relative frequencies in the sample:

The function is thus a monotonically increasing right-hand continuous step function with jumps at the respective characteristic values .

Definition for classified data

Empirical distribution function for classified data.

Sometimes data is only available in classified form , i. H. There are classes with class limits , class limits and relative class frequencies given .

Then the distribution function is defined as

At the upper and lower class limits, the definition agrees with the definition for unclassified data, but linear interpolation now takes place in the areas in between (see also sum frequency polygon ), which assumes that the observations are evenly distributed within the classes . Empirical distribution functions of classified data (as well as distribution functions of continuous probability distributions, e.g. normal distribution) are continuous , but only differentiable between the class boundaries , their increase corresponding to the height of the respective column of the underlying histogram.

It should be noted, however, that the interval limits of classified data are chosen so that the observed characteristic values lie between and not (as in the case of unclassified data) on the interval limits, which, depending on the choice of class limits, may be easy for one and the same database different sum frequency polygons can arise.

Examples

General case: unclassified data

The horse kick data from Ladislaus von Bortkewitsch should serve as an example . In the period from 1875 to 1894, a total of 196 soldiers died from horse kicks in 14 cavalry regiments of the Prussian army:

Empirical distribution function of the unclassified horse tread data.
year 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
dead 3 5 7th 9 10 18th 6th 14th 11 9 5 11 15th 6th 11 17th 12 15th 8th 4th 196

If you write down the table with the characteristic values ​​and relative frequencies, then you get

3 4th 5 6th 7th 8th 9 10 11 12 14th 15th 17th 18th
Years 1 1 2 2 1 1 2 1 3 1 1 2 1 1
0.05 0.05 0.10 0.10 0.05 0.05 0.10 0.05 0.15 0.05 0.05 0.10 0.05 0.05
0.05 0.10 0.20 0.30 0.35 0.40 0.50 0.55 0.70 0.75 0.80 0.90 0.95 1.00

The last line contains the value of the distribution function in the appropriate place . For example, at the point it results .

Classified data

If you classify the data, you get the following data table. The graphic for this can be found in the definition.

from 2 4th 6th 8th 10 12 14th 16
to 4th 6th 8th 10 12 14th 16 18th
0.10 0.20 0.10 0.15 0.20 0.05 0.10 0.10
0.10 0.30 0.40 0.55 0.75 0.80 0.90 1.00

The last line contains the value of the distribution function in the appropriate place . At the point it results .

Convergence properties

The strong law of large numbers ensures that the estimator will almost certainly converge to the true distribution function for any value :

,

d. H. the estimator is consistent. Thus the point-wise convergence of the empirical distribution function against the true distribution function is given. Another, stronger result, the Glivenko-Cantelli theorem, says that this happens evenly:

.

This property is the mathematical reason why it makes sense at all to describe data with an empirical distribution function.

Ogive

Ogive (distribution function) of a theoretical and an empirical distribution.

Ogive originally referred to the Gothic architectural style element pointed arch and the reinforced ribs in the vaults. The expression was first used in statistics for a distribution function by Francis Galton in 1875 :

"When the objects are marshalled in the order of their magnitude along a level base at equal distances apart, a line drawn freely through the tops of the ordinates .. will form a curve of double curvature ... Such a curve is called, in the phraseology of architects, an 'ogive'. "

- Francis Galton : From Statistics by intercomparison with remarks on the Law of Frequency of Error. , Philosophical Magazine 49, p. 35

The ordered (often grouped) characteristics are plotted on the horizontal axis of the coordinate system; the relative cumulative frequencies in percent on the vertical axis.

The graphic on the right shows the cumulative distribution function of a theoretical standard normal distribution . If the right part of the curve is mirrored at the point (dashed red), the resulting figure looks like an ogive.

An empirical distribution function is shown below. For the graph, 50 random numbers were drawn from a standard normal distribution. The more random numbers you draw, the closer you get to the theoretical distribution function.

literature

  • Horst Mayer: Descriptive Statistics. Munich - Vienna 1995

See also