Projection pursuit

Projection Pursuit (literally tracking the projection ) is a statistical process to simplify a lot of high-dimensional data in such a way that the most "interesting" structures are revealed. For this purpose, a hyperplane (e.g. an area) is placed in the space spanned by the data, onto which the data is projected.

The Projection Pursuit was first published in 1974 by John W. Tukey and Jerome H. Friedman and found further dissemination through the work of Peter J. Huber (around 1985).

The analysis of multivariate data is usually carried out by means of a suitable mapping into lower dimensions. The best- known example is the scatter diagram , in which two dimensions form the axes of a coordinate system. Each such image always more or less hides the visibility of existing structures, but can never reinforce them.

The idea of Projection Pursuit has been applied to various statistical problems:

Exploratory Projection Pursuit to reveal interesting structures in data
Projection Pursuit Regression ( PPR for short )
Projection pursuit density estimation
Projection pursuit classification
Projection Pursuit Discriminant Analysis

Exploratory Projection Pursuit

Fig. 1: Projection of data points on the corners of a six-dimensional cube (cube6) onto a two-dimensional hyperplane . The data are approximately standard normal distributed in the plane.

Fig. 2: Solution of the cube6 data set optimized with the "Central Mass" index in GGobi.

Fig. 3: Visualization of the "Central Mass" index function in GGobi.

In the Exploratory Projection Pursuit, each hyperplane is assigned a measure (or index) that indicates how interesting the structure it contains is. In the work of P. Diaconis and D. Freedman it was shown that most of the structures in the hyperplanes resemble normally distributed data (see Fig. 1). Many measures therefore measure the distance between the structure in the hyperplane and a normal distribution .

After that, all possible projections of the data are automatically calculated one after the other on a hyperplane that is reduced by one or more dimensions compared to the original data . If data points are identified as part of an interesting structure, they are removed from the analysis. The procedure is repeated with the reduced data record until no structure can be recognized any more.

Indices

The multivariate data are usually also transformed so that the mean values of the variables are equal to zero and the variance-covariance matrix is the identity matrix . If the projection vectors for the hyperplane are the data projected into the hyperplane, the density function of the standard normal distribution (or the corresponding normal distribution, if instead is used) and the density function of the projected data in the hyperplane, then the following indices, among others, were: then to be maximized, suggested: ${\ displaystyle Z = S_ {X} ^ {- 1/2} (X - {\ bar {x}})}$ ${\ displaystyle Z}$ ${\ displaystyle S_ {Z}}$ ${\ displaystyle \ alpha}$ ${\ displaystyle Y = \ alpha ^ {T} Z}$ ${\ displaystyle \ Phi}$ ${\ displaystyle Y = \ alpha ^ {T} X}$ ${\ displaystyle Y = \ alpha ^ {T} Z}$ ${\ displaystyle f}$

Friedman Tukey Index

In principle, any test statistic that belongs to a test for normal distribution can be used as an index. A maximization then leads to the hyperplanes in which the data is not normally distributed. Special versions of the indexes , and are maximized by certain structures, e.g. B. Central Hole or Central Ground. ${\ displaystyle I_ {L}}$ ${\ displaystyle I_ {H}}$ ${\ displaystyle I_ {NH}}$

The unknown density function of the projected data is estimated using either a kernel density estimator or an orthonormal function expansion. ${\ displaystyle f (y)}$

Related methods

As special cases of the Exploratory Projection Pursuit one can

look at the Grand Tour , during which the structures are discovered by the viewer himself in the graphics, and
the principal component analysis in which the index is described by. ${\ displaystyle I _ {\ mathrm {PCA}} (\ alpha) = \ operatorname {Var} (\ alpha ^ {T} X)}$

Projection Pursuit Regression

In the case of regression, the unknown regression function is iteratively represented by regression functions on the projected data: ${\ displaystyle f (x) = \ sum _ {k} f_ {k} (\ alpha _ {k} ^ {T} x)}$ ${\ displaystyle f_ {k}}$

${\ displaystyle y_ {i} ^ {(1)} = y_ {i}}$ are the observed regression values
So find that is minimal ${\ displaystyle \ alpha _ {k}}$ ${\ displaystyle \ epsilon _ {k} = \ sum _ {i} (y_ {i} ^ {(k)} - f_ {k} (\ alpha _ {k} ^ {T} x_ {i})) ^ {2}}$
Set ${\ displaystyle y_ {i} ^ {(k + 1)} = y_ {i} ^ {(k)} - f_ {k} (\ alpha _ {k} ^ {T} x_ {i})}$
Iterate steps 2–3 until it is smaller than a given limit or is no longer smaller ${\ displaystyle \ epsilon _ {k}}$
Improve the approximation by minimizing again for each ${\ displaystyle \ alpha _ {k}}$ ${\ displaystyle \ sum _ {i} \ left (y_ {i} - \ sum _ {l \ neq k} f_ {l} (\ alpha _ {l} ^ {T} x_ {i}) - f_ {k } (\ alpha _ {k} ^ {T} x_ {i}) \ right) ^ {2}}$

Projection pursuit density estimation

An iterative method is also used in the case of density estimation. The unknown density function is approximated as the product of density functions of the projected data: ${\ displaystyle f (x)}$

{\ displaystyle f (x) = \ Phi (x) \ prod _ {k} f_ {k} (\ alpha _ {k} ^ {T} x)}

with the density function of the multivariate normal distribution with the parameters and estimated from the data. Then the normal distribution density is corrected step by step. In contrast to the regression case, however, the algorithm is much more complicated, since there are no observations available that can be adapted. ${\ displaystyle \ Phi (x)}$ ${\ displaystyle {\ bar {x}}}$ ${\ displaystyle S}$ ${\ displaystyle y_ {i}}$

Web links

Website by Jerome H. Friedman
N-Land: a Graphical Tool for Exploring N-Dimensional Data
GGobi: free software for statistical analysis; offers Projection Pursuit.
In R in the library stats the command ppr

Individual evidence

^ ^A ^b J. H. Friedman and J. W. Tukey (Sept. 1974): A Projection Pursuit Algorithm for Exploratory Data Analysis . IEEE Transactions on Computers C-23 9: p. 881 ff. Doi : 10.1109 / TC.1974.224051 . ISSN 0018-9340 .
↑ ^a ^b P.J. Huber (1985): Projection pursuit , Annals of Statistics, 13, No. 2, p. 435 ff.
↑ ^a ^b J.H. Friedman (1987): Exploratory projection pursuit , Journal of the American Statistical Assoc., 82, No. 397, pp. 249-266.
↑ JH Friedman, W. Stuetzle (1981): regression Projection pursuit , Journal of the American Statistical Association 76, pp 817-823
↑ JH Friedman, W. Stuetzle, A. Schröder (1984): Projection pursuit density estimation , Journal of the American Statistical Association 79, pp. 599-608
↑ JH Friedman, W. Stuetzle (1981): Projection pursuit classification , unpublished manuscript
↑ J. Polzehl (1995): Projection pursuit discriminant analysis , Computational Statistics & Data Analysis 20, pp. 141–157
↑ P. Diaconis, D. Freedman (1989): Asymptotics of graphical projection pursuit , The Annals of Statistics 17, No. 1, pp. 793-815.
^ P. Hall (1989): On polynomial-based projection indices for exploratory projection pursuit , The Annals of Statistics 17, No. 2, pp. 589-605.
↑ D. Cook, A. Buja, J. Cabrera (1993): Projection pursuit indices based on orthonormal function expansion , Journal of Computational and Graphical Statistics 2, No. 3, pp. 225-250
^ C. Posse (1995): Projection pursuit exploratory data analysis , Computational Statistics and Data Analysis, 20, pp. 669-687.

[ft74-1] A ^b J. H. Friedman and J. W. Tukey (Sept. 1974): A Projection Pursuit Algorithm for Exploratory Data Analysis . IEEE Transactions on Computers C-23 9: p. 881 ff. Doi : 10.1109 / TC.1974.224051 . ISSN 0018-9340 .

[huber85-2] P.J. Huber (1985): Projection pursuit , Annals of Statistics, 13, No. 2, p. 435 ff.

[friedman87-3] J.H. Friedman (1987): Exploratory projection pursuit , Journal of the American Statistical Assoc., 82, No. 397, pp. 249-266.

[4] JH Friedman, W. Stuetzle (1981): regression Projection pursuit , Journal of the American Statistical Association 76, pp 817-823

[5] JH Friedman, W. Stuetzle, A. Schröder (1984): Projection pursuit density estimation , Journal of the American Statistical Association 79, pp. 599-608

[6] JH Friedman, W. Stuetzle (1981): Projection pursuit classification , unpublished manuscript

[7] J. Polzehl (1995): Projection pursuit discriminant analysis , Computational Statistics & Data Analysis 20, pp. 141–157

[8] P. Diaconis, D. Freedman (1989): Asymptotics of graphical projection pursuit , The Annals of Statistics 17, No. 1, pp. 793-815.

[9] P. Hall (1989): On polynomial-based projection indices for exploratory projection pursuit , The Annals of Statistics 17, No. 2, pp. 589-605.

[10] D. Cook, A. Buja, J. Cabrera (1993): Projection pursuit indices based on orthonormal function expansion , Journal of Computational and Graphical Statistics 2, No. 3, pp. 225-250

[11] C. Posse (1995): Projection pursuit exploratory data analysis , Computational Statistics and Data Analysis, 20, pp. 669-687.