Measure of connection

from Wikipedia, the free encyclopedia

A measure of connection or association measure, in statistics, shows the strength and possibly the direction of a connection between two statistical variables .

General

Depending on the prerequisite, there are one or more possible measures of relationship, e.g. B.

Non-standardized measures of association are those that are only comparable for tables of the same dimensions and / or with the same sample size . These measures usually take on the value zero if there is no dependency between the characteristics under consideration. Standardized measures of association assume values ​​in an interval; this can also be used to assess the strength of the connection.

Standardized measures of association, in which at least one characteristic is nominally scaled, usually only take values ​​in the interval . If both features are scaled at least ordinally, the standardized measures of association take on values ​​in the interval (case 1) or (case 2). In the first case, a direction is given in addition to the strength of the relationship.

The second case also includes the error reduction measures. It is assumed here that a predictive value can be calculated for the dependent variable. Once with knowledge of the relationship (depending on the value / category of the independent variable, I predict a certain value / category of the dependent variable) and once without knowledge of the relationship (only based on the values ​​/ categories of the dependent variable). Then the reduction of the prediction error with both methods is considered. This indirectly quantifies the relationship between the variables. This also leads to asymmetrical measures, depending on which of the two variables is the dependent variable. Asymmetrical here means that the value of the coefficient changes if one considers the observation series instead of the observation series .

Coefficients

For two nominal variables

The coefficients for two nominally scaled variables are based on a contingency table with the common frequencies (or probabilities for random variables). For the direct measurement of the connection, the quadratic contingency is used, which compares the observed common frequencies with the expected common frequencies under independence (= no connection). If the two frequencies for one or more combinations of characteristic values ​​differ from one another, then there is a connection. There are also special coefficients for 2x2 contingency tables. Measures of association for nominal variables can also be used for ordinal or metrically discrete features. However, some of the information in the data, e.g. B. the ranking of the characteristics, not used.

coefficient Range of values comment
Quadratic contingency greater than or equal to zero non-standardized, symmetrical
Mean square contingency greater than or equal to zero standardized for 2x2 contingency tables, symmetrical
Contingency coefficient greater than or equal to zero and less than one non-standardized, symmetrical
Corrected contingency coefficient in the interval standardized, symmetrical
Cramérs V in the interval (?) standardized, symmetrical
Phi coefficient in the interval (?) standardized, symmetrical, special case of Cramérs V for 2x2 contingency tables
Odds ratio greater than or equal to zero non-standardized, asymmetrical, mostly for 2x2 contingency tables
Goodman and Kruskal's Lambda in the interval standardized, symmetrical and asymmetrical, error reduction measure
Goodman and Kruskal's Tau in the interval standardized, symmetrical and asymmetrical, error reduction measure
Uncertainty coefficient in the interval standardized, symmetrical and asymmetrical, error reduction measure

For two ordinal variables

For coefficients for two ordinally scaled variables, the number of pairs of observations is determined that are concordant ( and ) or discordant ( and ). Concordant couples are more likely to suggest a positive relationship; H. small values ​​of with small values ​​of and large values ​​of with large values ​​of occur in the observations . Discordant couples tend to speak in favor of a negative relationship, i.e. H. small values ​​of with large values ​​of and large values ​​of with small values ​​of occur in the observations . A measure of correlation is then calculated from the number of concordants and discordants. The individual coefficients then differ in the way that ties , i.e. H. Observation pairs are taken into account with and / or .

An alternative is to use ranks . Each observation value is assigned a rank that indicates its position in the sorted series of values. The same thing happens with the values. Then, for each observation, the rank of is compared with the rank of . The more the ranks match in an observation, the more it speaks for a positive relationship. The more the ranks differ in an observation, the more it speaks for a negative relationship.

Measures of association for ordinal variables can also be used for metric features. In this case, too, some of the information in the data is not used; on the other hand, these coefficients are then robust against outliers and also indicate non-linear relationships.

coefficient Range of values comment
Covariance for ranks in the interval non-standardized, symmetrical, difference of concordant and discordant pairs
Kendall's Tau a in the interval standardized, symmetrical, does not consider ties
Kendall's Tau b in the interval standardized, symmetrical, does not consider pairs of observations with and , does not reach the values and on non-quadratic tables
Kendall's rope c in the interval standardized, symmetrical, does not consider ties, but corrects for non-square tables
Kendall's rope in the interval standardized, symmetrical, does not consider pairs of observations with and
Goodman and Kruskal's gamma in the interval standardized, symmetrical, shows values ​​that are too high when there are bonds, the absolute amount is a measure of error reduction
Yule's Q in the interval standardized, symmetrical, special case of Goodman and Kruskal's gamma for dichotomous variables, can also be used for nominal variables
Spearman's rank correlation coefficient in the interval standardized, symmetrical, implicitly requires that adjacent ranks always have the same distance

For two scale variables

Construction of the covariance :

In the case of coefficients for two metrically scaled variables, the distance from to an average of the values ​​and the distance from to an average of the values ​​are determined for each observation . Then the product of the two distances is calculated for each observation and averaged over all observations. Positive values ​​of the product speak for a positive connection, negative values ​​for a negative connection. The graphic on the right shows this for the covariance of an observation series: For each observation, the distance to the mean is determined, then multiplied and averaged. The coefficients differ in how the distance is calculated and which mean value is used (arithmetic mean or median).

The Spearman's rank correlation coefficient also follows this scheme, instead of and using the ranks of and in the Bravais-Pearson correlation. By the properties of the ranks, e.g. For example , the Bravais-Pearson correlation formula can be simplified.

coefficient Range of values comment
Covariance in the interval non-standardized, symmetrical, not robust, only measures the linear relationship
Bravais-Pearson correlation in the interval standardized, symmetrical, not robust, only measures the linear relationship
Quadrant correlation in the interval standardized, symmetrical, robust, also measures non-linear relationships
Coefficient of determination in the interval standardized, symmetrical, not robust, error reduction measure

For two variables of different scale levels

One possibility that is often used is to use a coefficient which is suitable for two variables of the low scale level. Is z. If, for example, one variable is ordinal, the other is scaled metrically, then one coefficient is used for two ordinal variables. One accepts that not all information in the observations is used.

This becomes very problematic when one variable is metric (continuous) and the other is nominal. Therefore a number of special coefficients have been developed for different scale levels. It is not possible to switch the roles of the variables in the formulas; H. it makes no sense to speak of symmetric or asymmetric coefficients.

coefficient Range of values comment
Eta square nominal metric in the interval Error reduction measure, not robust
Point bisiserial correlation dichotomous metric in the interval not robust