Measure of connection
A measure of connection or association measure, in statistics, shows the strength and possibly the direction of a connection between two statistical variables .
General
Depending on the prerequisite, there are one or more possible measures of relationship, e.g. B.
- depending on the scale level of the features or random variables : categorical (nominal, ordinal) or metric and
- depending on whether you want to use a standardized or a non-standardized measure.
Non-standardized measures of association are those that are only comparable for tables of the same dimensions and / or with the same sample size . These measures usually take on the value zero if there is no dependency between the characteristics under consideration. Standardized measures of association assume values in an interval; this can also be used to assess the strength of the connection.
Standardized measures of association, in which at least one characteristic is nominally scaled, usually only take values in the interval . If both features are scaled at least ordinally, the standardized measures of association take on values in the interval (case 1) or (case 2). In the first case, a direction is given in addition to the strength of the relationship.
The second case also includes the error reduction measures. It is assumed here that a predictive value can be calculated for the dependent variable. Once with knowledge of the relationship (depending on the value / category of the independent variable, I predict a certain value / category of the dependent variable) and once without knowledge of the relationship (only based on the values / categories of the dependent variable). Then the reduction of the prediction error with both methods is considered. This indirectly quantifies the relationship between the variables. This also leads to asymmetrical measures, depending on which of the two variables is the dependent variable. Asymmetrical here means that the value of the coefficient changes if one considers the observation series instead of the observation series .
Coefficients
For two nominal variables
The coefficients for two nominally scaled variables are based on a contingency table with the common frequencies (or probabilities for random variables). For the direct measurement of the connection, the quadratic contingency is used, which compares the observed common frequencies with the expected common frequencies under independence (= no connection). If the two frequencies for one or more combinations of characteristic values differ from one another, then there is a connection. There are also special coefficients for 2x2 contingency tables. Measures of association for nominal variables can also be used for ordinal or metrically discrete features. However, some of the information in the data, e.g. B. the ranking of the characteristics, not used.
coefficient | Range of values | comment |
---|---|---|
Quadratic contingency | greater than or equal to zero | non-standardized, symmetrical |
Mean square contingency | greater than or equal to zero | standardized for 2x2 contingency tables, symmetrical |
Contingency coefficient | greater than or equal to zero and less than one | non-standardized, symmetrical |
Corrected contingency coefficient | in the interval | standardized, symmetrical |
Cramérs V | in the interval (?) | standardized, symmetrical |
Phi coefficient | in the interval (?) | standardized, symmetrical, special case of Cramérs V for 2x2 contingency tables |
Odds ratio | greater than or equal to zero | non-standardized, asymmetrical, mostly for 2x2 contingency tables |
Goodman and Kruskal's Lambda | in the interval | standardized, symmetrical and asymmetrical, error reduction measure |
Goodman and Kruskal's Tau | in the interval | standardized, symmetrical and asymmetrical, error reduction measure |
Uncertainty coefficient | in the interval | standardized, symmetrical and asymmetrical, error reduction measure |
For two ordinal variables
For coefficients for two ordinally scaled variables, the number of pairs of observations is determined that are concordant ( and ) or discordant ( and ). Concordant couples are more likely to suggest a positive relationship; H. small values of with small values of and large values of with large values of occur in the observations . Discordant couples tend to speak in favor of a negative relationship, i.e. H. small values of with large values of and large values of with small values of occur in the observations . A measure of correlation is then calculated from the number of concordants and discordants. The individual coefficients then differ in the way that ties , i.e. H. Observation pairs are taken into account with and / or .
An alternative is to use ranks . Each observation value is assigned a rank that indicates its position in the sorted series of values. The same thing happens with the values. Then, for each observation, the rank of is compared with the rank of . The more the ranks match in an observation, the more it speaks for a positive relationship. The more the ranks differ in an observation, the more it speaks for a negative relationship.
Measures of association for ordinal variables can also be used for metric features. In this case, too, some of the information in the data is not used; on the other hand, these coefficients are then robust against outliers and also indicate non-linear relationships.
coefficient | Range of values | comment |
---|---|---|
Covariance for ranks | in the interval | non-standardized, symmetrical, difference of concordant and discordant pairs |
Kendall's Tau a | in the interval | standardized, symmetrical, does not consider ties |
Kendall's Tau b | in the interval | standardized, symmetrical, does not consider pairs of observations with and , does not reach the values and on non-quadratic tables |
Kendall's rope c | in the interval | standardized, symmetrical, does not consider ties, but corrects for non-square tables |
Kendall's rope | in the interval | standardized, symmetrical, does not consider pairs of observations with and |
Goodman and Kruskal's gamma | in the interval | standardized, symmetrical, shows values that are too high when there are bonds, the absolute amount is a measure of error reduction |
Yule's Q | in the interval | standardized, symmetrical, special case of Goodman and Kruskal's gamma for dichotomous variables, can also be used for nominal variables |
Spearman's rank correlation coefficient | in the interval | standardized, symmetrical, implicitly requires that adjacent ranks always have the same distance |
For two scale variables
In the case of coefficients for two metrically scaled variables, the distance from to an average of the values and the distance from to an average of the values are determined for each observation . Then the product of the two distances is calculated for each observation and averaged over all observations. Positive values of the product speak for a positive connection, negative values for a negative connection. The graphic on the right shows this for the covariance of an observation series: For each observation, the distance to the mean is determined, then multiplied and averaged. The coefficients differ in how the distance is calculated and which mean value is used (arithmetic mean or median).
The Spearman's rank correlation coefficient also follows this scheme, instead of and using the ranks of and in the Bravais-Pearson correlation. By the properties of the ranks, e.g. For example , the Bravais-Pearson correlation formula can be simplified.
coefficient | Range of values | comment |
---|---|---|
Covariance | in the interval | non-standardized, symmetrical, not robust, only measures the linear relationship |
Bravais-Pearson correlation | in the interval | standardized, symmetrical, not robust, only measures the linear relationship |
Quadrant correlation | in the interval | standardized, symmetrical, robust, also measures non-linear relationships |
Coefficient of determination | in the interval | standardized, symmetrical, not robust, error reduction measure |
For two variables of different scale levels
One possibility that is often used is to use a coefficient which is suitable for two variables of the low scale level. Is z. If, for example, one variable is ordinal, the other is scaled metrically, then one coefficient is used for two ordinal variables. One accepts that not all information in the observations is used.
This becomes very problematic when one variable is metric (continuous) and the other is nominal. Therefore a number of special coefficients have been developed for different scale levels. It is not possible to switch the roles of the variables in the formulas; H. it makes no sense to speak of symmetric or asymmetric coefficients.
coefficient | Range of values | comment | ||
---|---|---|---|---|
Eta square | nominal | metric | in the interval | Error reduction measure, not robust |
Point bisiserial correlation | dichotomous | metric | in the interval | not robust |