Empirical distribution function
An empirical distribution function - also known as the cumulative frequency function or distribution function of the sample - is a function in descriptive statistics and stochastics that assigns to each real number the proportion of sample values that are smaller or equal . The empirical distribution function can be defined in different ways.
definition
general definition
If the observation values are in the sample, then the empirical distribution function is defined as
with if and zero otherwise, d. H. here denotes the indicator function of the quantity . The empirical distribution function thus corresponds to the distribution function of the empirical distribution .
Alternatively, the empirical distribution function can be defined with the characteristic values and the associated relative frequencies in the sample:
The function is thus a monotonically increasing right-hand continuous step function with jumps at the respective characteristic values .
Definition for classified data
Sometimes data is only available in classified form , i. H. There are classes with class limits , class limits and relative class frequencies given .
Then the distribution function is defined as
At the upper and lower class limits, the definition agrees with the definition for unclassified data, but linear interpolation now takes place in the areas in between (see also sum frequency polygon ), which assumes that the observations are evenly distributed within the classes . Empirical distribution functions of classified data (as well as distribution functions of continuous probability distributions, e.g. normal distribution) are continuous , but only differentiable between the class boundaries , their increase corresponding to the height of the respective column of the underlying histogram.
It should be noted, however, that the interval limits of classified data are chosen so that the observed characteristic values lie between and not (as in the case of unclassified data) on the interval limits, which, depending on the choice of class limits, may be easy for one and the same database different sum frequency polygons can arise.
Examples
General case: unclassified data
The horse kick data from Ladislaus von Bortkewitsch should serve as an example . In the period from 1875 to 1894, a total of 196 soldiers died from horse kicks in 14 cavalry regiments of the Prussian army:
year | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | |
dead | 3 | 5 | 7th | 9 | 10 | 18th | 6th | 14th | 11 | 9 | 5 | 11 | 15th | 6th | 11 | 17th | 12 | 15th | 8th | 4th | 196 |
If you write down the table with the characteristic values and relative frequencies, then you get
3 | 4th | 5 | 6th | 7th | 8th | 9 | 10 | 11 | 12 | 14th | 15th | 17th | 18th | |
Years | 1 | 1 | 2 | 2 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 | 1 | 1 |
0.05 | 0.05 | 0.10 | 0.10 | 0.05 | 0.05 | 0.10 | 0.05 | 0.15 | 0.05 | 0.05 | 0.10 | 0.05 | 0.05 | |
0.05 | 0.10 | 0.20 | 0.30 | 0.35 | 0.40 | 0.50 | 0.55 | 0.70 | 0.75 | 0.80 | 0.90 | 0.95 | 1.00 |
The last line contains the value of the distribution function in the appropriate place . For example, at the point it results .
Classified data
If you classify the data, you get the following data table. The graphic for this can be found in the definition.
from | 2 | 4th | 6th | 8th | 10 | 12 | 14th | 16 |
to | 4th | 6th | 8th | 10 | 12 | 14th | 16 | 18th |
0.10 | 0.20 | 0.10 | 0.15 | 0.20 | 0.05 | 0.10 | 0.10 | |
0.10 | 0.30 | 0.40 | 0.55 | 0.75 | 0.80 | 0.90 | 1.00 |
The last line contains the value of the distribution function in the appropriate place . At the point it results .
Convergence properties
The strong law of large numbers ensures that the estimator will almost certainly converge to the true distribution function for any value :
- ,
d. H. the estimator is consistent. Thus the point-wise convergence of the empirical distribution function against the true distribution function is given. Another, stronger result, the Glivenko-Cantelli theorem, says that this happens evenly:
- .
This property is the mathematical reason why it makes sense at all to describe data with an empirical distribution function.
Ogive
Ogive originally referred to the Gothic architectural style element pointed arch and the reinforced ribs in the vaults. The expression was first used in statistics for a distribution function by Francis Galton in 1875 :
"When the objects are marshalled in the order of their magnitude along a level base at equal distances apart, a line drawn freely through the tops of the ordinates .. will form a curve of double curvature ... Such a curve is called, in the phraseology of architects, an 'ogive'. "
The ordered (often grouped) characteristics are plotted on the horizontal axis of the coordinate system; the relative cumulative frequencies in percent on the vertical axis.
The graphic on the right shows the cumulative distribution function of a theoretical standard normal distribution . If the right part of the curve is mirrored at the point (dashed red), the resulting figure looks like an ogive.
An empirical distribution function is shown below. For the graph, 50 random numbers were drawn from a standard normal distribution. The more random numbers you draw, the closer you get to the theoretical distribution function.
literature
- Horst Mayer: Descriptive Statistics. Munich - Vienna 1995