Principal component analysis
The principal component analysis (short: HKA , English Principal Component Analysis , short: PCA ; the mathematical method is also known as the principal axis transformation or singular value decomposition ) is a method of multivariate statistics . It is used to structure, simplify and illustrate large data sets by approximating a large number of statistical variables with a smaller number of linear combinations (the "main components") that are as meaningful as possible . Principal component analysis, also known as Karhunen - Loève transformation , is used especially in image processing . It is to be distinguished from factor analysis, with which it has formal similarity and in which it can be used as an approximate method for factor extraction (the difference between the two methods is explained in the article Factor Analysis).
There are several generalizations of principal component analysis, e.g. B. the principal curves , the principal surfaces or the core-based principal component analysis ( kernel principal component analysis , short: kernel PCA ).
history
Principal component analysis was introduced by Karl Pearson in 1901 and further developed by Harold Hotelling in the 1930s . Like other statistical analysis methods, it only became widespread with the increasing availability of computers in the third quarter of the 20th century. The first applications came from biology.
Conception of the principal component analysis
The underlying data set typically has the structure of a matrix: characteristics were measured on test subjects or objects . Such a data set can be visualized as a set of points in -dimensional space . The aim of the principal component analysis is to project these data points into a -dimensional subspace ( ) in such a way that as little information as possible is lost and the redundancy present is summarized in the form of correlation in the data points.
Mathematically, a principal axis transformation is carried out: The correlation of multidimensional features is minimized by converting them into a vector space with a new basis . The principal axis transformation can be specified by an orthogonal matrix that is formed from the eigenvectors of the covariance matrix . The principal component analysis is therefore problem-dependent because a separate transformation matrix must be calculated for each data set. The rotation of the coordinate system is carried out so that the covariance matrix is diagonalized, i.e. H. the data are decorrelated (the correlations are the off-diagonal entries of the covariance matrix). For normally distributed data sets, this means that the individual components of each data set are statistically independent of one another according to the PCA, since the normal distribution is fully characterized by the zeroth (normalization), first ( expected value ) and second moment ( covariances ). If the data sets are not normally distributed, the data will still be statistically dependent after the PCA - although now decorrelated. The PCA is therefore only an “optimal” method for normally distributed data sets.
Application example
Artillery ships of the Second World War are considered (see warship data ). They are divided into the classes battleships , heavy cruisers , light cruisers and destroyers . Data are available for around 200 ships. The characteristics of length, width, water displacement , draft , engine power , speed (maximum possible long-term speed), operating range and number of crews were recorded. The characteristics of length, breadth, displacement and draft can be interpreted in such a way that they all measure a similar set of facts, which one could describe as the factor “size”. The question is whether other factors determine the data. There is actually a second clear factor, which is mainly determined by the performance of the machines and the maximum speed. It could be summed up as a “speed” factor.
Other examples of principal component analysis applications are:
- Applying principal component analysis to consumer buying behavior, there may be latent factors such as social status, age, or marital status that motivate certain purchases. Here, targeted advertising could channel the desire to buy accordingly.
- If you have a statistical model with a large number of features, the number of variables in the model could possibly be reduced with the help of the principal component analysis, which usually increases the model quality.
- Principal component analysis is also used in image processing - especially in remote sensing . You can analyze satellite images and draw conclusions from them.
- Another area is artificial intelligence , together with neural networks . There, the PCA is used to separate features within the framework of the automatic classification or in the pattern recognition .
Procedure
idea
The data is available as a point cloud in a -dimensional Cartesian coordinate system .
Best linear approximation to the data set
The calculation of the main components can be seen as an iterative process. In the right graphic, the line that best approximates the data is searched for for the data points (open circles). The error of a data point is the Euclidean distance between the straight line and the data points. For the data point at the top right, the error is the red line that is perpendicular to the black straight line. The first main component is the straight line for which the sum of the squares of these errors is minimal.
A further straight line is then sought that also goes through the mean value of the data points and is orthogonal to the first straight line: the second main component. In the case of two-dimensional data, this is simply the straight line perpendicular to the first main component. Otherwise the next main component is perpendicular to all previous main components; With this condition, the straight line is determined again in which the sum of squares of the distances is minimal. In this way the further straight lines up to the -th main component can be determined.
Maximizing the variance
The distance between the center of the data and a data point is independent of which straight line through the center is considered as the “reference” (see the red line from the center of the data to the data point at the top right). Using the Pythagorean theorem, however, we can divide the distance into the part in the direction of the black straight line and another part at right angles to it. Minimizing the distances perpendicular to the straight line (while maintaining the distance to the data center, length of the red line) means maximizing the distances in the direction of the black straight line ( must be retained). The sum of the squares of the distances in the direction of the black line form the variance of the data in this direction.
This leads to the following algorithm: The first axis should be placed through the point cloud in such a way that the variance of the data in this direction is maximal. The second axis is perpendicular to the first axis. In their direction, the variance is the second largest, etc.
For the -dimensional data there are basically axes that are perpendicular to each other, they are orthogonal . The total variance of the data is the sum of these "axis variances". With the axes, a new coordinate system is now placed in the point cloud. The new coordinate system can be represented as a rotation of the variable axes.
If a sufficiently large percentage of the total variance is now covered by the first ( ) axes, the main components that are represented by the new axes appear to be sufficient for the information content of the data. The total variance of the data is therefore a measure of their information content .
Often the main components cannot be interpreted in terms of content. In statistics it is said that no understandable hypothesis can be attributed to them (see factor analysis ).
Statistical model
One looks at random variables that are centered on their expected values . That is, their expected values have been subtracted from the random variable. These random variables are combined in a -dimensional random vector . This has as an expected value vector to the zero vector and the - covariance matrix , the symmetric and positive semi 's. The eigenvalues , the matrix are sorted in descending order according to size. They are listed as diagonal elements in the diagonal matrix . The eigenvectors belonging to them form the orthogonal matrix . It then applies
If the random vector is linearly transformed to , then the covariance matrix of is just the diagonal matrix .
To clarify, we consider a three-dimensional random vector
- .
The matrix of the eigenvalues of the covariance matrix of is
where is.
The normalized eigenvectors can be summarized as columns of the matrix :
- .
The matrix-vector multiplication
gives the equations
- .
The variance of is
So has the main component for the largest share of the total variance of the data, the second largest share, etc. The elements , ; , could be called the contribution of the variable to the factor . In this context, the matrix is called the charge matrix , it indicates "how high a variable loads on a factor ".
Estimation of the model parameters
If there are specific collected data with features (i.e. each data point is a -dimensional vector ), the sample correlation matrix is calculated from the feature values. The eigenvalues and eigenvectors for the principal component analysis are then determined from this matrix. Since the covariance matrix is a symmetric matrix, parameters must be estimated for its calculation . This only makes sense if the number of data points in the data set is significantly larger, i.e. H. if . Otherwise the determination of the covariance matrix is highly error-prone and this method should not be used.
Example with three variables
The above application example is now illustrated in numbers:
We consider the variables length, latitude and speed. The scatter diagrams give an impression of the common distribution of the variables.
Principal component analysis was performed on these three variables using a statistical program. The charge matrix is
factor | A. | B. | C. |
length | 0.862 | 0.481 | −0.159 |
width | 0.977 | 0.083 | 0.198 |
speed | −0.679 | 0.730 | 0.082 |
So the factor is made up of
- .
In particular, the contributions of length and width to the first factor are large. For the second factor, the contribution made by speed is particularly great. The third factor is unclear and probably irrelevant.
The total variance of the data is distributed among the main components as follows:
factor | Eigenvalue | Percent of total variance |
Percentage of the cumulative variance in the total variance |
A. | 2.16 | 71.97 | 71.97 |
B. | 0.77 | 25.67 | 97.64 |
C. | 0.07 | 2.36 | 100.00 |
The first two main components already cover 97.64% of the total variance of the data. The third factor does not add anything worth mentioning to the information content.
Example with eight variables
Eight characteristics of the artillery ships have now been subjected to a principal component analysis. The table of the cargo matrix, here called the “component matrix”, shows that the variables length, breadth, draft, water displacement and crew strength particularly load onto the first main component. This component could be called “size”. The second component is largely explained by PS and knots. It could be called "speed". A third component uploads to the radius of action.
The first two factors already cover approx. 84% of the information in the ship's data, the third factor covers approx. 10% again. The additional contribution of the remaining components is insignificant.
Component matrix | ||||||||
---|---|---|---|---|---|---|---|---|
component | ||||||||
1 | 2 | 3 | 4th | 5 | 6th | 7th | 8th | |
Water displacement BRT | 0.948 | −0.094 | −0.129 | 0.228 | 0.040 | 0.036 | 0.136 | 0.055 |
Length m | 0.906 | 0.302 | −0.064 | −0.209 | 0.128 | −0.144 | −0.007 | −0.050 |
Width m | 0.977 | −0.128 | −0.031 | 0.032 | 0.103 | −0.017 | −0.014 | 0.129 |
Draft m | 0.934 | −0.276 | −0.061 | 0.014 | 0.074 | 0.129 | 0.154 | −0.038 |
1000 hp | 0.552 | 0.779 | -0.196 | −0.133 | −0.099 | 0.143 | −0.038 | 0.018 |
Knot sm / h | −0.520 | 0.798 | −0.157 | 0.222 | 0.109 | −0.038 | 0.071 | 0.004 |
Action radius 100 nm | 0.398 | 0.311 | 0.862 | 0.038 | 0.008 | 0.022 | −0.002 | −0.005 |
Team strength | 0.955 | 0.063 | −0.052 | 0.108 | −0.226 | −0.121 | 0.067 | 0.002 |
Extraction method: principal component analysis | ||||||||
Eight components extracted |
Variance of the components | |||
---|---|---|---|
component | Eigenvalues | ||
Total | % of the variance | Cumulative | |
1 | 5.19 | 64.88 | 64.88 |
2 | 1.54 | 19.22 | 84.10 |
3 | 0.83 | 10.43 | 94.53 |
4th | 0.18 | 2.22 | 96.74 |
5 | 0.11 | 1.34 | 98.08 |
6th | 0.08 | 0.95 | 99.03 |
7th | 0.05 | 0.67 | 99.70 |
8th | 0.02 | 0.30 | 100.00 |
Application in cluster analysis and dimension reduction
Principal component analysis (PCA) is also often used in cluster analysis and to reduce the dimension of the parameter space, especially when one has no idea (model) of the structure of the data. This makes use of the fact that the PCA rotates the (orthogonal) coordinate system in such a way that the covariance matrix is diagonalized. In addition, the PCA rearranges the order of the coordinate axes (the main components) so that the first main component contains the largest share of the total variance in the data set, the second main component the second largest share, etc. As was illustrated by the examples in the previous section, you can usually delete the rear main components (i.e. those that only contain a small proportion of the total variation) without replacing them without causing any significant loss of information.
The basic assumption for the use of the PCA for cluster analysis and dimension reduction is: The directions with the greatest dispersion (variance) contain the most information.
In this context it is very important that this basic assumption is only a working hypothesis , which does not always have to be true. To illustrate this, two examples follow:
- Signal Variance ( German signal variance ): The chart on the right titled PCA signal Variance shows an example in which the assumption is correct. The data set consists of two clusters (red and green) that are clearly separated from each other. The spread of the data points within each cluster is very small compared to the “distance” between the two clusters. The first main component will accordingly be x_1. It can also be clearly seen that the first main component x_1 is completely sufficient to separate the two clusters from one another, while the second main component x_2 does not contain any useful information about this. The number of dimensions can therefore be reduced from 2 to 1 (by neglecting x_2) without losing essential information about the two clusters. The total variance of the data set is thus dominated by the signal (two separate clusters).
- Noise Variance ( German noise variance ): The chart on the right titled PCA Noise Variance shows an example in which the assumption is not correct and the PCA can not be used for dimension reduction. The spread within the two clusters is now significantly larger and accounts for the main part of the total spread. Assuming that this scattering within the cluster is caused by noise, this case is called noise variance . The first main component will be x_2, which does not contain any information about the separability of the two clusters.
These two examples show how the PCA can be used to reduce the dimension and for cluster analysis or that this is not always possible. Whether or not the basic assumption that the directions of the greatest variation are really the most interesting is true or not depends on the given data set and can often not be checked - especially when the number of dimensions is very high and the data therefore does not match let more fully visualize.
Connection with multidimensional scaling
Both the multidimensional scaling and the principal component analysis condense the data. If Euclidean distances are used in (metric) multidimensional scaling and if the dimension of the configuration is equal to the number of main components, then both methods provide the same solution. This is because the diagonalization of the covariance matrix (or correlation matrix , if standardized data is used) corresponds to a rotation of the coordinate system in the principal component analysis. As a result, the distances between the observations, which form the starting point in the multidimensional scaling, remain the same.
However, other distances can also be used in multi-dimensional scaling; In this respect, the principal component analysis can be viewed as a special case of multidimensional scaling.
See also
literature
Original work
- Karl Pearson: On lines and planes of closest fit to a system of points in space . In: The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. Series 6, 2, 1901, pp. 559-572, doi : 10.1080 / 14786440109462720 .
Textbooks
- GH Dunteman: Principal Component Analysis . Sage Publications, 1989
- L. Fahrmeir, A. Hamerle, G. Tutz (Ed.): Multivariate statistical methods . New York 1996
- A. Handl, T. Kuhlenkasper: Multivariate Analysis Methods. Theory and Practice with R. 3rd edition. Springer, Berlin 2017, ISBN 978-3-662-54753-3 .
- J. Hartung, B. Elpelt: Multivariate Statistics . Munich / Vienna 1999
- T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning: Data Mining, Inference, and Prediction . 2001
- W. Kessler: Multivariate data analysis . Weinheim 2007 (An introduction to PCA with a sample CD)
- WJ Krzanowski: Principles of Multivariate Analysis . Rev. ed. Oxford University Press, Oxford 2000
- KV Mardia, JT Kent, JM Bibby: Multivariate Analysis . New York 1979
Web links
- Lindsay I Smith: A tutorial on Principal Components Analysis. (PDF; 117 kB)