Kernel density estimator

from Wikipedia, the free encyclopedia

The core density estimation (also Parzen window method ; English kernel density estimation , KDE ) is a statistical method for estimating the probability distribution of a random variable .

In classical statistics, it is assumed that statistical phenomena follow a certain probability distribution and that this distribution is realized in samples . In non-parametric statistics , methods are developed to identify the underlying distribution from the realization of a sample. One known technique is to create a histogram . The disadvantage of this method is that the resulting histogram is not continuous . In many cases, however, it can be assumed that the underlying distribution has a constant density function , such as the distribution of waiting times in a queue or the return on shares.

The kernel density estimators described below , on the other hand, are methods that enable a continuous estimation of the unknown distribution. More precisely: A kernel density estimator is a uniformly consistent , continuous estimator of the density of an unknown probability measure through a sequence of densities.

example

Kernel density estimation

In the following example, the density of a standard normal distribution (dashed in black) is estimated by kernel density estimation. In the concrete situation of the estimation this curve is of course unknown and should be estimated by the kernel density estimation. A sample (100 in size) was generated, which is distributed according to this standard normal distribution. A kernel density estimation was then carried out with different bandwidths . It can be clearly seen that the quality of the kernel density estimator depends on the bandwidth chosen. A bandwidth that is too small appears "blurred", while a bandwidth that is too large is too "coarse".

Cores

Kernel density estimation with Cauchy kernel

With core which is continuous Lebesgue density of almost any size to be selected probability measure referred to. Possible cores are for example:

  • Gaussian kernel
  • Cauchy core
  • Picard core
  • Epanechnikov core

These cores are densities of similar shape to the cauchy core shown. The kernel density estimator represents an overlay in the form of the sum of appropriately scaled kernels that are positioned depending on the sample implementation. The scaling and a prefactor ensure that the resulting sum in turn represents the density of a probability measure. The following figure is based on a sample size 10, which is shown as black circles. Above that, the cauchy nuclei (dashed green) are shown, from the superposition of which the nuclear density estimator results (red curve).

The Epanechnikov kernel is that kernel that minimizes the mean square deviation of the associated kernel density estimator among all kernels .

The kernel density estimator

definition

Is a sample , a core, so that is kernel density estimation for bandwidth is defined as:

.

Theorem of Nadaraya

The choice of bandwidth is crucial for the quality of the approximation. With a corresponding bandwidth selected as a function of the sample size, the sequence of the kernel density estimators converges almost uniformly to the density of the unknown probability measure. This statement is made more concrete in Nadaraya's theorem . The theorem provides the statement that with an appropriately selected bandwidth, any good estimate of the unknown distribution is possible by choosing a suitably large sample:

Be a core of limited variation . Let the density of a probability measure be uniformly continuous . With and are defined for the bandwidths . Then the sequence of kernel density estimators converges uniformly to probability 1 , i.e. H.

.

application

The core density estimation has been used by statisticians since around 1950 and is often used in ecology to describe the range of action of an animal since this method was introduced into the scientific branch in the 1990s. This can be used to calculate the probability that an animal will be in a certain spatial area. Action area predictions are represented by colored lines (e.g. isolines ). This application is also based on the " heat map " visualization of the whereabouts of team players (e.g. in football) during the season , which has been customary since around 2010 .

See also

Individual evidence

  1. E. Parzen: On estimation of a probability density function and mode . In: Ann. Math. Stat. , 33, 1962, pp. 1065-1076. doi: 10.1214 / aoms / 1177704472
  2. É. Nadaraya: On Non-Parametric Estimates of Density Functions and Regression Curves . In: Theory of Probability & Its Applications . tape 10 , no. 1 , January 1, 1965, ISSN  0040-585X , p. 186–190 , doi : 10.1137 / 1110024 ( siam.org [accessed June 24, 2016]).
  3. ^ Arthur R. Rodgers, John G. Kie: HRT: Home Range Tools for ArcGIS® . User's Manual. August 10, 2011, p. 6th ff . ( lakeheadu.ca [PDF; accessed October 24, 2011]).