Projection pursuit

from Wikipedia, the free encyclopedia

Projection Pursuit (literally tracking the projection ) is a statistical process to simplify a lot of high-dimensional data in such a way that the most "interesting" structures are revealed. For this purpose, a hyperplane (e.g. an area) is placed in the space spanned by the data, onto which the data is projected.

The Projection Pursuit was first published in 1974 by John W. Tukey and Jerome H. Friedman and found further dissemination through the work of Peter J. Huber (around 1985).

The analysis of multivariate data is usually carried out by means of a suitable mapping into lower dimensions. The best- known example is the scatter diagram , in which two dimensions form the axes of a coordinate system. Each such image always more or less hides the visibility of existing structures, but can never reinforce them.

The idea of ​​Projection Pursuit has been applied to various statistical problems:

  • Exploratory Projection Pursuit to reveal interesting structures in data
  • Projection Pursuit Regression ( PPR for short )
  • Projection pursuit density estimation
  • Projection pursuit classification
  • Projection Pursuit Discriminant Analysis

Exploratory Projection Pursuit

Fig. 1: Projection of data points on the corners of a six-dimensional cube (cube6) onto a two-dimensional hyperplane . The data are approximately standard normal distributed in the plane.
Fig. 2: Solution of the cube6 data set optimized with the "Central Mass" index in GGobi.
Fig. 3: Visualization of the "Central Mass" index function in GGobi.

In the Exploratory Projection Pursuit, each hyperplane is assigned a measure (or index) that indicates how interesting the structure it contains is. In the work of P. Diaconis and D. Freedman it was shown that most of the structures in the hyperplanes resemble normally distributed data (see Fig. 1). Many measures therefore measure the distance between the structure in the hyperplane and a normal distribution .

After that, all possible projections of the data are automatically calculated one after the other on a hyperplane that is reduced by one or more dimensions compared to the original data . If data points are identified as part of an interesting structure, they are removed from the analysis. The procedure is repeated with the reduced data record until no structure can be recognized any more.

Indices

The multivariate data are usually also transformed so that the mean values ​​of the variables are equal to zero and the variance-covariance matrix is the identity matrix . If the projection vectors for the hyperplane are the data projected into the hyperplane, the density function of the standard normal distribution (or the corresponding normal distribution, if instead is used) and the density function of the projected data in the hyperplane, then the following indices, among others, were: then to be maximized, suggested:

Friedman Tukey Index
The index is minimized by a parabolic density function , which is very similar to the density function of a standard normal distribution .
Entropy index
is the entropy, which is also minimized by the standard normal distribution .
Legendre Index, Hermite Index and Natural Hermite Index
,
and
.
All three indices measure the distance to the standard normal distribution , they only differ in the way in which the difference between the density of the projected data and the standard normal distribution is weighted.
-Index
partitions a (two-dimensional) plane into 48 cells and then applies a test of fit to compare the number of observations in each cell with the number of observations assuming the standard normal distribution .

In principle, any test statistic that belongs to a test for normal distribution can be used as an index. A maximization then leads to the hyperplanes in which the data is not normally distributed. Special versions of the indexes , and are maximized by certain structures, e.g. B. Central Hole or Central Ground.

The unknown density function of the projected data is estimated using either a kernel density estimator or an orthonormal function expansion.

Related methods

As special cases of the Exploratory Projection Pursuit one can

  • look at the Grand Tour , during which the structures are discovered by the viewer himself in the graphics, and
  • the principal component analysis in which the index is described by.

Projection Pursuit Regression

In the case of regression, the unknown regression function is iteratively represented by regression functions on the projected data:

  1. are the observed regression values
  2. So find that is minimal
  3. Set
  4. Iterate steps 2–3 until it is smaller than a given limit or is no longer smaller
  5. Improve the approximation by minimizing again for each

Projection pursuit density estimation

An iterative method is also used in the case of density estimation. The unknown density function is approximated as the product of density functions of the projected data:

with the density function of the multivariate normal distribution with the parameters and estimated from the data. Then the normal distribution density is corrected step by step. In contrast to the regression case, however, the algorithm is much more complicated, since there are no observations available that can be adapted.

See also

Web links

Individual evidence

  1. ^ A b J. H. Friedman and J. W. Tukey (Sept. 1974): A Projection Pursuit Algorithm for Exploratory Data Analysis . IEEE Transactions on Computers C-23 9: p. 881 ff. Doi : 10.1109 / TC.1974.224051 . ISSN  0018-9340 .
  2. a b P.J. Huber (1985): Projection pursuit , Annals of Statistics, 13, No. 2, p. 435 ff.
  3. a b J.H. Friedman (1987): Exploratory projection pursuit , Journal of the American Statistical Assoc., 82, No. 397, pp. 249-266.
  4. JH Friedman, W. Stuetzle (1981): regression Projection pursuit , Journal of the American Statistical Association 76, pp 817-823
  5. JH Friedman, W. Stuetzle, A. Schröder (1984): Projection pursuit density estimation , Journal of the American Statistical Association 79, pp. 599-608
  6. JH Friedman, W. Stuetzle (1981): Projection pursuit classification , unpublished manuscript
  7. J. Polzehl (1995): Projection pursuit discriminant analysis , Computational Statistics & Data Analysis 20, pp. 141–157
  8. P. Diaconis, D. Freedman (1989): Asymptotics of graphical projection pursuit , The Annals of Statistics 17, No. 1, pp. 793-815.
  9. ^ P. Hall (1989): On polynomial-based projection indices for exploratory projection pursuit , The Annals of Statistics 17, No. 2, pp. 589-605.
  10. D. Cook, A. Buja, J. Cabrera (1993): Projection pursuit indices based on orthonormal function expansion , Journal of Computational and Graphical Statistics 2, No. 3, pp. 225-250
  11. ^ C. Posse (1995): Projection pursuit exploratory data analysis , Computational Statistics and Data Analysis, 20, pp. 669-687.