Principal component analysis

Principal component analysis as factor analysis : two main components of a two-dimensional point cloud ( orthogonally rotated )

The principal component analysis (short: HKA , English Principal Component Analysis , short: PCA ; the mathematical method is also known as the principal axis transformation or singular value decomposition ) is a method of multivariate statistics . It is used to structure, simplify and illustrate large data sets by approximating a large number of statistical variables with a smaller number of linear combinations (the "main components") that are as meaningful as possible . Principal component analysis, also known as Karhunen - Loève transformation , is used especially in image processing . It is to be distinguished from factor analysis, with which it has formal similarity and in which it can be used as an approximate method for factor extraction (the difference between the two methods is explained in the article Factor Analysis).

There are several generalizations of principal component analysis, e.g. B. the principal curves , the principal surfaces or the core-based principal component analysis ( kernel principal component analysis , short: kernel PCA ).

history

Principal component analysis was introduced by Karl Pearson in 1901 and further developed by Harold Hotelling in the 1930s . Like other statistical analysis methods, it only became widespread with the increasing availability of computers in the third quarter of the 20th century. The first applications came from biology.

Conception of the principal component analysis

The underlying data set typically has the structure of a matrix: characteristics were measured on test subjects or objects . Such a data set can be visualized as a set of points in -dimensional space . The aim of the principal component analysis is to project these data points into a -dimensional subspace ( ) in such a way that as little information as possible is lost and the redundancy present is summarized in the form of correlation in the data points. ${\ displaystyle n}$${\ displaystyle p}$${\ displaystyle n}$${\ displaystyle p}$${\ displaystyle \ mathbb {R} ^ {p}}$${\ displaystyle q}$${\ displaystyle \ mathbb {R} ^ {q}}$${\ displaystyle q

Mathematically, a principal axis transformation is carried out: The correlation of multidimensional features is minimized by converting them into a vector space with a new basis . The principal axis transformation can be specified by an orthogonal matrix that is formed from the eigenvectors of the covariance matrix . The principal component analysis is therefore problem-dependent because a separate transformation matrix must be calculated for each data set. The rotation of the coordinate system is carried out so that the covariance matrix is ​​diagonalized, i.e. H. the data are decorrelated (the correlations are the off-diagonal entries of the covariance matrix). For normally distributed data sets, this means that the individual components of each data set are statistically independent of one another according to the PCA, since the normal distribution is fully characterized by the zeroth (normalization), first ( expected value ) and second moment ( covariances ). If the data sets are not normally distributed, the data will still be statistically dependent after the PCA - although now decorrelated. The PCA is therefore only an “optimal” method for normally distributed data sets.

Application example

Artillery ships of the Second World War are considered (see warship data ). They are divided into the classes battleships , heavy cruisers , light cruisers and destroyers . Data are available for around 200 ships. The characteristics of length, width, water displacement , draft , engine power , speed (maximum possible long-term speed), operating range and number of crews were recorded. The characteristics of length, breadth, displacement and draft can be interpreted in such a way that they all measure a similar set of facts, which one could describe as the factor “size”. The question is whether other factors determine the data. There is actually a second clear factor, which is mainly determined by the performance of the machines and the maximum speed. It could be summed up as a “speed” factor.

Other examples of principal component analysis applications are:

• Applying principal component analysis to consumer buying behavior, there may be latent factors such as social status, age, or marital status that motivate certain purchases. Here, targeted advertising could channel the desire to buy accordingly.
• If you have a statistical model with a large number of features, the number of variables in the model could possibly be reduced with the help of the principal component analysis, which usually increases the model quality.
• Principal component analysis is also used in image processing - especially in remote sensing . You can analyze satellite images and draw conclusions from them.
• Another area is artificial intelligence , together with neural networks . There, the PCA is used to separate features within the framework of the automatic classification or in the pattern recognition .

Procedure

First main component of the data (black-dark red line) and the center of the data (thick black point)

idea

The data is available as a point cloud in a -dimensional Cartesian coordinate system . ${\ displaystyle p}$

Best linear approximation to the data set

The calculation of the main components can be seen as an iterative process. In the right graphic, the line that best approximates the data is searched for for the data points (open circles). The error of a data point is the Euclidean distance between the straight line and the data points. For the data point at the top right, the error is the red line that is perpendicular to the black straight line. The first main component is the straight line for which the sum of the squares of these errors is minimal.

A further straight line is then sought that also goes through the mean value of the data points and is orthogonal to the first straight line: the second main component. In the case of two-dimensional data, this is simply the straight line perpendicular to the first main component. Otherwise the next main component is perpendicular to all previous main components; With this condition, the straight line is determined again in which the sum of squares of the distances is minimal. In this way the further straight lines up to the -th main component can be determined. ${\ displaystyle p}$

Maximizing the variance

The distance between the center of the data and a data point is independent of which straight line through the center is considered as the “reference” (see the red line from the center of the data to the data point at the top right). Using the Pythagorean theorem, however, we can divide the distance into the part in the direction of the black straight line and another part at right angles to it. Minimizing the distances perpendicular to the straight line (while maintaining the distance to the data center, length of the red line) means maximizing the distances in the direction of the black straight line ( must be retained). The sum of the squares of the distances in the direction of the black line form the variance of the data in this direction. ${\ displaystyle a ^ {2} + b ^ {2} = c ^ {2}}$

This leads to the following algorithm: The first axis should be placed through the point cloud in such a way that the variance of the data in this direction is maximal. The second axis is perpendicular to the first axis. In their direction, the variance is the second largest, etc.

For the -dimensional data there are basically axes that are perpendicular to each other, they are orthogonal . The total variance of the data is the sum of these "axis variances". With the axes, a new coordinate system is now placed in the point cloud. The new coordinate system can be represented as a rotation of the variable axes. ${\ displaystyle p}$${\ displaystyle p}$${\ displaystyle p}$

If a sufficiently large percentage of the total variance is now covered by the first ( ) axes, the main components that are represented by the new axes appear to be sufficient for the information content of the data. The total variance of the data is therefore a measure of their information content . ${\ displaystyle r '}$${\ displaystyle r '

Often the main components cannot be interpreted in terms of content. In statistics it is said that no understandable hypothesis can be attributed to them (see factor analysis ).

Statistical model

One looks at random variables that are centered on their expected values . That is, their expected values ​​have been subtracted from the random variable. These random variables are combined in a -dimensional random vector . This has as an expected value vector to the zero vector and the - covariance matrix , the symmetric and positive semi 's. The eigenvalues , the matrix are sorted in descending order according to size. They are listed as diagonal elements in the diagonal matrix . The eigenvectors belonging to them form the orthogonal matrix . It then applies ${\ displaystyle p}$  ${\ displaystyle X_ {j}}$${\ displaystyle p}$ ${\ displaystyle \ mathbf {X}}$${\ displaystyle (p \ times p)}$ ${\ displaystyle \ mathbf {\ Sigma}}$ ${\ displaystyle \ lambda _ {j}}$${\ displaystyle j = 1, \ dots, p}$ ${\ displaystyle \ mathbf {\ Sigma}}$ ${\ displaystyle \ mathbf {\ Lambda}}$ ${\ displaystyle \ mathbf {\ Gamma}}$${\ displaystyle \ mathbf {\ Lambda} = \ mathbf {\ Gamma} ^ {T} \ mathbf {\ Sigma} \ mathbf {\ Gamma}.}$

If the random vector is linearly transformed to , then the covariance matrix of is just the diagonal matrix . ${\ displaystyle \ mathbf {X}}$ ${\ displaystyle \ mathbf {X} \ mapsto \ mathbf {Y} = \ mathbf {\ Gamma} ^ {T} \ mathbf {X}}$${\ displaystyle \ mathbf {Y}}$${\ displaystyle \ mathbf {\ Lambda}}$

To clarify, we consider a three-dimensional random vector

${\ displaystyle \ mathbf {X} = {\ begin {pmatrix} X_ {1} \\ X_ {2} \\ X_ {3} \ end {pmatrix}}}$.

The matrix of the eigenvalues ​​of the covariance matrix of is ${\ displaystyle \ mathbf {\ Sigma}}$${\ displaystyle \ mathbf {X}}$

${\ displaystyle \ mathbf {\ Lambda} = {\ begin {pmatrix} \ lambda _ {A} & 0 & 0 \\ 0 & \ lambda _ {B} & 0 \\ 0 & 0 & \ lambda _ {C} \ end {pmatrix}},}$

where is. ${\ displaystyle \ lambda _ {A} \ geq \ lambda _ {B} \ geq \ lambda _ {C}}$

The normalized eigenvectors can be summarized as columns of the matrix : ${\ displaystyle (3 \ times 1)}$${\ displaystyle {\ boldsymbol {\ gamma}} _ {j}}$${\ displaystyle \ mathbf {\ Gamma}}$

${\ displaystyle \ mathbf {\ Gamma} = {\ begin {pmatrix} {\ varvec {\ gamma}} _ {A} & {\ varvec {\ gamma}} _ {B} & {\ varvec {\ gamma}} _ {C} \ end {pmatrix}}}$
${\ displaystyle = {\ begin {pmatrix} \ gamma _ {1A} & \ gamma _ {1B} & \ gamma _ {1C} \\\ gamma _ {2A} & \ gamma _ {2B} & \ gamma _ { 2C} \\\ gamma _ {3A} & \ gamma _ {3B} & \ gamma _ {3C} \ end {pmatrix}}}$.
${\ displaystyle \ mathbf {X} \ rightarrow \ mathbf {Y} = \ mathbf {\ Gamma} ^ {T} \ mathbf {X}}$

gives the equations

${\ displaystyle Y_ {A} = \ gamma _ {1A} X_ {1} + \ gamma _ {2A} X_ {2} + \ gamma _ {3A} X_ {3}}$
${\ displaystyle Y_ {B} = \ gamma _ {1B} X_ {1} + \ gamma _ {2B} X_ {2} + \ gamma _ {3B} X_ {3}}$
${\ displaystyle Y_ {C} = \ gamma _ {1C} X_ {1} + \ gamma _ {2C} X_ {2} + \ gamma _ {3C} X_ {3}}$.

The variance of  is ${\ displaystyle Y_ {A}}$

${\ displaystyle \ operatorname {Var} (Y_ {A}) = \ lambda _ {A}.}$

So has the main component  for the largest share of the total variance of the data, the second largest share, etc. The elements  , ; , could be called the contribution of the variable to  the factor  . In this context, the matrix is called the charge matrix  , it indicates "how high a variable  loads on a factor  ". ${\ displaystyle Y_ {A}}$${\ displaystyle Y_ {B}}$${\ displaystyle \ gamma _ {jk}}$${\ displaystyle j = 1,2,3}$${\ displaystyle k = A, B, C}$${\ displaystyle X_ {j}}$${\ displaystyle k}$${\ displaystyle \ mathbf {\ Gamma}}$${\ displaystyle X}$${\ displaystyle Y}$

Estimation of the model parameters

If there are specific collected data with features (i.e. each data point is a -dimensional vector ), the sample correlation matrix is calculated from the feature values. The eigenvalues ​​and eigenvectors for the principal component analysis are then determined from this matrix. Since the covariance matrix is a symmetric matrix, parameters must be estimated for its calculation . This only makes sense if the number of data points in the data set is significantly larger, i.e. H. if . Otherwise the determination of the covariance matrix is ​​highly error-prone and this method should not be used. ${\ displaystyle p}$${\ displaystyle p}$${\ displaystyle p \ times p}$${\ displaystyle (p ^ {2} + p) / 2}$${\ displaystyle N}$${\ displaystyle N \ gg (p ^ {2} + p) / 2}$

Example with three variables

The above application example is now illustrated in numbers:

We consider the variables length, latitude and speed. The scatter diagrams give an impression of the common distribution of the variables.

Principal component analysis was performed on these three variables using a statistical program. The charge matrix  is ${\ displaystyle \ Gamma}$

 factor A. B. C. length 0.862 0.481 −0.159 width 0.977 0.083 0.198 speed −0.679 0.730 0.082

So the factor is made up of ${\ displaystyle y_ {A}}$

${\ displaystyle Y_ {A} = 0 {,} 862 \ cdot {\ text {length}} + 0 {,} 977 \ cdot {\ text {width}} - 0 {,} 679 \ cdot {\ text {speed }}}$.

In particular, the contributions of length and width to the first factor are large. For the second factor, the contribution made by speed is particularly great. The third factor is unclear and probably irrelevant.

The total variance of the data is distributed among the main components as follows:

 factor Eigenvalue ${\ displaystyle \ lambda _ {j}}$ Percent of total variance Percentage of the cumulative variance in the total variance A. 2.16 71.97 71.97 B. 0.77 25.67 97.64 C. 0.07 2.36 100.00

The first two main components already cover 97.64% of the total variance of the data. The third factor does not add anything worth mentioning to the information content.

Example with eight variables

Eight characteristics of the artillery ships have now been subjected to a principal component analysis. The table of the cargo matrix, here called the “component matrix”, shows that the variables length, breadth, draft, water displacement and crew strength particularly load onto the first main component. This component could be called “size”. The second component is largely explained by PS and knots. It could be called "speed". A third component uploads to the radius of action.

The first two factors already cover approx. 84% of the information in the ship's data, the third factor covers approx. 10% again. The additional contribution of the remaining components is insignificant.

Component matrix
component
1 2 3 4th 5 6th 7th 8th
Water displacement BRT 0.948 −0.094 −0.129 0.228 0.040 0.036 0.136 0.055
Length m 0.906 0.302 −0.064 −0.209 0.128 −0.144 −0.007 −0.050
Width m 0.977 −0.128 −0.031 0.032 0.103 −0.017 −0.014 0.129
Draft m 0.934 −0.276 −0.061 0.014 0.074 0.129 0.154 −0.038
1000 hp 0.552 0.779 -0.196 −0.133 −0.099 0.143 −0.038 0.018
Knot sm / h −0.520 0.798 −0.157 0.222 0.109 −0.038 0.071 0.004
Action radius 100 nm 0.398 0.311 0.862 0.038 0.008 0.022 −0.002 −0.005
Team strength 0.955 0.063 −0.052 0.108 −0.226 −0.121 0.067 0.002
Extraction method: principal component analysis
Eight components extracted
Variance of the components
component Eigenvalues
Total % of the variance Cumulative
1 5.19 64.88 64.88
2 1.54 19.22 84.10
3 0.83 10.43 94.53
4th 0.18 2.22 96.74
5 0.11 1.34 98.08
6th 0.08 0.95 99.03
7th 0.05 0.67 99.70
8th 0.02 0.30 100.00

Application in cluster analysis and dimension reduction

Two-dimensional example of a PCA. The two clusters have little internal variation. The first main component will be x_1, the second x_2. The major portion of the total dispersion is between the clusters ( signal variance or English signal variance ).
Two-dimensional example of a PCA. The two clusters have a very large internal spread. The first main component will be x_2, the second x_1. The major portion of the total dispersion is within the cluster ( noise variance or English noise variance ).

Principal component analysis (PCA) is also often used in cluster analysis and to reduce the dimension of the parameter space, especially when one has no idea (model) of the structure of the data. This makes use of the fact that the PCA rotates the (orthogonal) coordinate system in such a way that the covariance matrix is diagonalized. In addition, the PCA rearranges the order of the coordinate axes (the main components) so that the first main component contains the largest share of the total variance in the data set, the second main component the second largest share, etc. As was illustrated by the examples in the previous section, you can usually delete the rear main components (i.e. those that only contain a small proportion of the total variation) without replacing them without causing any significant loss of information.

The basic assumption for the use of the PCA for cluster analysis and dimension reduction is: The directions with the greatest dispersion (variance) contain the most information.

In this context it is very important that this basic assumption is only a working hypothesis , which does not always have to be true. To illustrate this, two examples follow:

• Signal Variance ( German signal variance ): The chart on the right titled PCA signal Variance shows an example in which the assumption is correct. The data set consists of two clusters (red and green) that are clearly separated from each other. The spread of the data points within each cluster is very small compared to the “distance” between the two clusters. The first main component will accordingly be x_1. It can also be clearly seen that the first main component x_1 is completely sufficient to separate the two clusters from one another, while the second main component x_2 does not contain any useful information about this. The number of dimensions can therefore be reduced from 2 to 1 (by neglecting x_2) without losing essential information about the two clusters. The total variance of the data set is thus dominated by the signal (two separate clusters).
• Noise Variance ( German noise variance ): The chart on the right titled PCA Noise Variance shows an example in which the assumption is not correct and the PCA can not be used for dimension reduction. The spread within the two clusters is now significantly larger and accounts for the main part of the total spread. Assuming that this scattering within the cluster is caused by noise, this case is called noise variance . The first main component will be x_2, which does not contain any information about the separability of the two clusters.

These two examples show how the PCA can be used to reduce the dimension and for cluster analysis or that this is not always possible. Whether or not the basic assumption that the directions of the greatest variation are really the most interesting is true or not depends on the given data set and can often not be checked - especially when the number of dimensions is very high and the data therefore does not match let more fully visualize.

Connection with multidimensional scaling

Both the multidimensional scaling and the principal component analysis condense the data. If Euclidean distances are used in (metric) multidimensional scaling and if the dimension of the configuration is equal to the number of main components, then both methods provide the same solution. This is because the diagonalization of the covariance matrix (or correlation matrix , if standardized data is used) corresponds to a rotation of the coordinate system in the principal component analysis. As a result, the distances between the observations, which form the starting point in the multidimensional scaling, remain the same.

However, other distances can also be used in multi-dimensional scaling; In this respect, the principal component analysis can be viewed as a special case of multidimensional scaling.

literature

Original work

• Karl Pearson: On lines and planes of closest fit to a system of points in space . In: The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. Series 6, 2, 1901, pp. 559-572, doi : 10.1080 / 14786440109462720 .

Textbooks

• GH Dunteman: Principal Component Analysis . Sage Publications, 1989
• L. Fahrmeir, A. Hamerle, G. Tutz (Ed.): Multivariate statistical methods . New York 1996
• A. Handl, T. Kuhlenkasper: Multivariate Analysis Methods. Theory and Practice with R. 3rd edition. Springer, Berlin 2017, ISBN 978-3-662-54753-3 .
• J. Hartung, B. Elpelt: Multivariate Statistics . Munich / Vienna 1999
• T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning: Data Mining, Inference, and Prediction . 2001
• W. Kessler: Multivariate data analysis . Weinheim 2007 (An introduction to PCA with a sample CD)
• WJ Krzanowski: Principles of Multivariate Analysis . Rev. ed. Oxford University Press, Oxford 2000
• KV Mardia, JT Kent, JM Bibby: Multivariate Analysis . New York 1979