Curse of dimensionality

A simple example of the curse of dimensionality : a centimeter consists of 10 millimeters , whereas a square centimeter consists of 100 square millimeters .

Curse of Dimensionality is a term introduced by Richard Bellman to describe the rapid increase in volume as you add more dimensions to a mathematical space .

Leo Breiman describes by way of example that 100 observations cover the one-dimensional space of real numbers between 0 and 1 well. A histogram can be calculated from these observations and conclusions can be drawn. If, for comparison, 100 samples are collected in a 10-dimensional space of the same type (each dimension can have values between 0 and 1), these are isolated points that do not cover the space sufficiently to make meaningful statements about this space. In order to achieve a similar coverage as in the one-dimensional space, 100 ¹⁰ = 10 ²⁰ samples have to be taken, which results in a much higher effort.

Effect on distance functions

An often quoted formulation of the "curse" states that for different types of random distributions of the data sets the difference between the smallest and the greatest distance between data sets compared to the smallest distance becomes arbitrarily small as the dimensionality d increases (in other words: the smallest and largest distance differ only relatively little), and therefore the results of distance functions (and algorithms based on them) can no longer be used for the relevant distributions in high-dimensional spaces. This can be formalized as

${\ displaystyle \ lim _ {d \ to \ infty} {\ frac {dist _ {\ max} -dist _ {\ min}} {dist _ {\ min}}} = 0}$

However, current research suggests that it is not the number of dimensions that matters, as additional relevant information can better separate the data. The effect is only caused by additional dimensions that are “irrelevant” to the separation. However, while the exact distance values become more similar, the resulting order then remains stable. Cluster analysis and outlier detection are still possible with suitable methods.

Another possibility to characterize the "curse" is the comparison between a -dimensional hypersphere with radius and a -dimensional hypercube with side lengths . The volume of the hypersphere is given by , where is the gamma function , and the volume of the hypercube is given by . If we now consider the quotient, it is noticeable that the volume of the hypersphere becomes very small ("insignificant") with increasing dimension compared to the volume of the hypercube, because ${\ displaystyle d}$ ${\ displaystyle r}$ ${\ displaystyle d}$ ${\ displaystyle 2r}$ ${\ displaystyle V_ {Sphaere} = {\ frac {2r ^ {d} \ pi ^ {d / 2}} {d \; \ Gamma (d / 2)}}}$ ${\ displaystyle \ Gamma}$ ${\ displaystyle V_ {cube} = (2r) ^ {d}}$

${\ displaystyle {\ frac {V _ {\ mathrm {Sphaere}}} {V _ {\ mathrm {Dice}}}} = {\ frac {\ frac {2r ^ {d} \ pi ^ {d / 2}} { d \; \ Gamma (d / 2)}} {(2r) ^ {d}}} = {\ frac {\ pi ^ {d / 2}} {d2 ^ {d-1} \ Gamma (d / 2 )}} \ rightarrow 0}$ for . ${\ displaystyle d \ rightarrow \ infty}$

This convergence can be determined by estimating

${\ displaystyle {\ frac {\ pi ^ {d / 2}} {d2 ^ {d-1} \ Gamma (d / 2)}} <{\ frac {(2 ^ {2}) ^ {d / 2 }} {d2 ^ {d-1} \ Gamma (d / 2)}} = {\ frac {2 ^ {d}} {d2 ^ {d-1} \ Gamma (d / 2)}} = {\ frac {2} {d \ Gamma (d / 2)}} \ rightarrow 0}$ For ${\ displaystyle d \ rightarrow \ infty}$

show, where is used that and for . ${\ displaystyle \ pi = 3.14 ... <4 = 2 ^ {2}}$ ${\ displaystyle \ Gamma (x)> 0}$ ${\ displaystyle x> 0}$

Machine learning

The curse of dimensionality is a serious hurdle in machine learning problems that have to learn the structure of a high-dimensional space from a few sample elements.

Index structures

Many index structures are either distance-based or attempt to divide space by dimension. These methods usually have problems with high-dimensional data, since either the distance values are no longer meaningful or the number of dimensions and the resulting partition options cause problems.

Numerical integration

In the numerical integration of the curse of dimensionality also plays a major role, as the number of required function evaluations for simple algorithms such as the trapezoidal rule increases exponentially with the number of dimensions. As a result, the speed of convergence drops drastically. In the case of simple tasks, this problem can be circumvented using the Monte Carlo method , as these are not dependent on the dimensionality. Otherwise hierarchical decomposition methods are used.

Individual evidence

↑ K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft: When is “Nearest Neighbor” Meaningful? In: Proc. 7th International Conference on Database Theory - ICDT'99 . LNCS 1540. Springer, 1999, ISBN 978-3-540-65452-0 , pp. 217 , doi : 10.1007 / 3-540-49257-7_15 .
↑ ME Houle, H.-P. Kriegel , P. Kröger, E. Schubert, A. Zimek: Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In: Proc. 21st International Conference on Scientific and Statistical Database Management (SSDBM) . Springer, Heidelberg, Germany 2010, doi : 10.1007 / 978-3-642-13818-8_34 .
↑ A. Zimek, E. Schubert, H.-P. Kriegel : A survey on unsupervised outlier detection in high-dimensional numerical data . In: Statistical Analysis and Data Mining . tape 5 , no. 5 , 2012, p. 363-387 , doi : 10.1002 / sam.11161 .

swell

Bellman, RE 1961. Adaptive Control Processes . Princeton University Press, Princeton, NJ.

[1] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft: When is “Nearest Neighbor” Meaningful? In: Proc. 7th International Conference on Database Theory - ICDT'99 . LNCS 1540. Springer, 1999, ISBN 978-3-540-65452-0 , pp. 217 , doi : 10.1007 / 3-540-49257-7_15 .

[2] ME Houle, H.-P. Kriegel , P. Kröger, E. Schubert, A. Zimek: Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In: Proc. 21st International Conference on Scientific and Statistical Database Management (SSDBM) . Springer, Heidelberg, Germany 2010, doi : 10.1007 / 978-3-642-13818-8_34 .

[3] A. Zimek, E. Schubert, H.-P. Kriegel : A survey on unsupervised outlier detection in high-dimensional numerical data . In: Statistical Analysis and Data Mining . tape 5 , no. 5 , 2012, p. 363-387 , doi : 10.1002 / sam.11161 .