Canonical correlation

from Wikipedia, the free encyclopedia

The canonical correlation is a measure of the mutual dependence of two groups of (random) variables. The canonical correlation analysis serves as an instrument of multivariate statistics to analyze this relationship. Canonical correlation analysis was introduced by Harold Hotelling in 1935 .

aims

Structure discovery

Since it was primarily developed as an instrument for exploratory statistics , it primarily serves to uncover interesting structures in the data, in this case to uncover interesting relationships between sets of variables in a given data set. In contrast to the simple Bravais-Pearson correlation coefficient , it is not the dependence between two individual variables that is of interest, but between two sets of variables .

Dimension reduction

Another area of ​​application of canonical correlation analysis is to reduce the dimension of the data set under investigation by using the canonical variables with the highest correlation instead of the original variables on which the canonical variables are based. It is important that the canonical variables can be interpreted well and as clearly as possible, as otherwise the replacement of the original variables will lead to interpretation problems.

Action

Two sets of random variables and .

The goal of (linear) canonical correlation analysis is to reveal suitable canonical variables, i.e. H. suitable linear combinations of the variables each of a set of variables. The canonical correlation coefficient is determined from the canonical variables, which indicates the degree of mutual linear dependence between the canonical variables and thus between the sets of random variables.

The linear combinations are considered

and

.

We are looking for those weighting vectors or that maximize the correlation between and .

Orthogonal pairs of factors are extracted, which gradually correlate less with one another. The goal is the maximum elucidation of covariance (similar to the principal component analysis , which aims at successively maximum variance elucidation ). The correlation between the first pair of factors, i.e. H. the one with the highest correlation, is the first canonical correlation. Overall, pairs of factors can be extracted, since a maximum of as many factors can be extracted as there are variables in a group.

Characteristic values

Various parameters can be calculated to assess the solution.

Redundancy measures

Redundancy measures indicate how superfluous (redundant) a survey or a set of variables is if the observations from the second set of variables are known. In other words, redundancy measures indicate how much variance of one set of variables is explained by the other set of variables.

properties

The range of values ​​for the canonical correlation coefficient is [0.1].

Connection with other proceedings

Many other multivariate methods are special cases of canonical correlation analysis or are closely related to it.

If a variable set consists of only one single variable, the canonical correlation coefficient corresponds to the multiple correlation coefficient . If both sets consist of only one variable, the canonical correlation coefficient and the absolute value of the simple (Bravais-Pearson) correlation coefficient are identical.

The model of canonical correlation analysis can be seen as a path model with two latent variables and the respective indicator sets X and Y respectively.

If the direction of the relationship between the sets of variables is known from theoretical considerations, a multiple linear regression can be used. H. a regression analysis with multiple dependent variables.

Also, factor analysis , discriminant analysis , analysis of variance and many other multivariate methods are closely related with the canonical correlation analysis.

application

The canonical correlation analysis is used e.g. B. in the analysis of latent variables, which are operationalized by several measurable variables. An example is measuring the relationship between the results of a personality test and those of an achievement test.

Procedures for canonical correlation analysis are built into many statistical programs, e.g. B. in GNU R using the function cancor () from the stats package .

Individual evidence

  1. ^ W. Härdle, L. Simar: Applied Multivariate Statistical Analysis . 2nd Edition. Springer, 2007, p. 321 .
  2. Horst Rinne: Pocket book of statistics . 3. Edition. Verlag Harri Deutsch, 2003, p. 84 .
  3. ^ H. Hotelling: The most predictable criterion . In: Journal of Educational Psychology . tape 26 , 1935, pp. 139-142 .
  4. ^ A b Jürgen Bortz: Statistics for human and social scientists . 6th edition. Springer, 2005, p. 627 .
  5. a b Werner Voss: Pocket book of statistics . 1st edition. Fachbuchverlag Leipzig, 2000, p. 516 .
  6. Horst Rinne: Pocket book of statistics . 3. Edition. Verlag Harri Deutsch, 2003, p. 700 .
  7. Bernd Rönz, Hans G. Strohe: Lexicon Statistics . Gabler Wirtschaft, 1994, p. 175 .