Measure of connection

A measure of connection or association measure, in statistics, shows the strength and possibly the direction of a connection between two statistical variables .

General

Depending on the prerequisite, there are one or more possible measures of relationship, e.g. B.

depending on the scale level of the features or random variables : categorical (nominal, ordinal) or metric and
depending on whether you want to use a standardized or a non-standardized measure.

Non-standardized measures of association are those that are only comparable for tables of the same dimensions and / or with the same sample size . These measures usually take on the value zero if there is no dependency between the characteristics under consideration. Standardized measures of association assume values in an interval; this can also be used to assess the strength of the connection.

Standardized measures of association, in which at least one characteristic is nominally scaled, usually only take values in the interval . If both features are scaled at least ordinally, the standardized measures of association take on values in the interval (case 1) or (case 2). In the first case, a direction is given in addition to the strength of the relationship. ${\ displaystyle [0; 1]}$ ${\ displaystyle [-1; 1]}$ ${\ displaystyle [0; 1]}$

The second case also includes the error reduction measures. It is assumed here that a predictive value can be calculated for the dependent variable. Once with knowledge of the relationship (depending on the value / category of the independent variable, I predict a certain value / category of the dependent variable) and once without knowledge of the relationship (only based on the values / categories of the dependent variable). Then the reduction of the prediction error with both methods is considered. This indirectly quantifies the relationship between the variables. This also leads to asymmetrical measures, depending on which of the two variables is the dependent variable. Asymmetrical here means that the value of the coefficient changes if one considers the observation series instead of the observation series . ${\ displaystyle (x_ {i}, y_ {i})}$ ${\ displaystyle (y_ {i}, x_ {i})}$

Coefficients

For two nominal variables

The coefficients for two nominally scaled variables are based on a contingency table with the common frequencies (or probabilities for random variables). For the direct measurement of the connection, the quadratic contingency is used, which compares the observed common frequencies with the expected common frequencies under independence (= no connection). If the two frequencies for one or more combinations of characteristic values differ from one another, then there is a connection. There are also special coefficients for 2x2 contingency tables. Measures of association for nominal variables can also be used for ordinal or metrically discrete features. However, some of the information in the data, e.g. B. the ranking of the characteristics, not used.

coefficient	Range of values	comment
Quadratic contingency	greater than or equal to zero	non-standardized, symmetrical
Mean square contingency	greater than or equal to zero	standardized for 2x2 contingency tables, symmetrical
Contingency coefficient	greater than or equal to zero and less than one	non-standardized, symmetrical
Corrected contingency coefficient	in the interval ${\ displaystyle [0; 1]}$	standardized, symmetrical
Cramérs V	in the interval (?) ${\ displaystyle [0; 1]}$	standardized, symmetrical
Phi coefficient	in the interval (?) ${\ displaystyle [0; 1]}$	standardized, symmetrical, special case of Cramérs V for 2x2 contingency tables
Odds ratio	greater than or equal to zero	non-standardized, asymmetrical, mostly for 2x2 contingency tables
Goodman and Kruskal's Lambda	in the interval ${\ displaystyle [0; 1]}$	standardized, symmetrical and asymmetrical, error reduction measure
Goodman and Kruskal's Tau	in the interval ${\ displaystyle [0; 1]}$	standardized, symmetrical and asymmetrical, error reduction measure
Uncertainty coefficient	in the interval ${\ displaystyle [0; 1]}$	standardized, symmetrical and asymmetrical, error reduction measure

For two ordinal variables

For coefficients for two ordinally scaled variables, the number of pairs of observations is determined that are concordant ( and ) or discordant ( and ). Concordant couples are more likely to suggest a positive relationship; H. small values of with small values of and large values of with large values of occur in the observations . Discordant couples tend to speak in favor of a negative relationship, i.e. H. small values of with large values of and large values of with small values of occur in the observations . A measure of correlation is then calculated from the number of concordants and discordants. The individual coefficients then differ in the way that ties , i.e. H. Observation pairs are taken into account with and / or . ${\ displaystyle (x_ {i}, y_ {i}), (x_ {j}, y_ {j})}$ ${\ displaystyle x_ {i} <x_ {j}}$ ${\ displaystyle y_ {i} <y_ {j}}$ ${\ displaystyle x_ {i} <x_ {j}}$ ${\ displaystyle y_ {i}> y_ {j}}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle x_ {i} = x_ {j}}$ ${\ displaystyle y_ {i} = y_ {j}}$

An alternative is to use ranks . Each observation value is assigned a rank that indicates its position in the sorted series of values. The same thing happens with the values. Then, for each observation, the rank of is compared with the rank of . The more the ranks match in an observation, the more it speaks for a positive relationship. The more the ranks differ in an observation, the more it speaks for a negative relationship. ${\ displaystyle x_ {i}}$ ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle x_ {i}}$ ${\ displaystyle y_ {i}}$

Measures of association for ordinal variables can also be used for metric features. In this case, too, some of the information in the data is not used; on the other hand, these coefficients are then robust against outliers and also indicate non-linear relationships.

coefficient	Range of values	comment
Covariance for ranks	in the interval ${\ displaystyle \ left [- {\ tfrac {n (n-1)} {2}}; + {\ tfrac {n (n-1)} {2}} \ right]}$	non-standardized, symmetrical, difference of concordant and discordant pairs
Kendall's Tau a	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, does not consider ties
Kendall's Tau b	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, does not consider pairs of observations with and , does not reach the values and on non-quadratic tables ${\ displaystyle x_ {i} = x_ {j}}$ ${\ displaystyle y_ {i} = y_ {j}}$ ${\ displaystyle -1}$ ${\ displaystyle +1}$
Kendall's rope c	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, does not consider ties, but corrects for non-square tables
Kendall's rope	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, does not consider pairs of observations with and ${\ displaystyle x_ {i} = x_ {j}}$ ${\ displaystyle y_ {i} = y_ {j}}$
Goodman and Kruskal's gamma	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, shows values that are too high when there are bonds, the absolute amount is a measure of error reduction
Yule's Q	in the interval ${\ displaystyle [-1; 1]}$	standardized, symmetrical, special case of Goodman and Kruskal's gamma for dichotomous variables, can also be used for nominal variables
Spearman's rank correlation coefficient	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, implicitly requires that adjacent ranks always have the same distance

For two scale variables

Construction of the covariance :

{\ displaystyle s_ {xy}: = {\ frac {1} {n}} \ sum _ {i = 1} ^ {n} (x_ {i} - {\ bar {x}}) (y_ {i} - {\ bar {y}})}

In the case of coefficients for two metrically scaled variables, the distance from to an average of the values and the distance from to an average of the values are determined for each observation . Then the product of the two distances is calculated for each observation and averaged over all observations. Positive values of the product speak for a positive connection, negative values for a negative connection. The graphic on the right shows this for the covariance of an observation series: For each observation, the distance to the mean is determined, then multiplied and averaged. The coefficients differ in how the distance is calculated and which mean value is used (arithmetic mean or median). ${\ displaystyle x_ {i}}$ ${\ displaystyle X}$ ${\ displaystyle y_ {i}}$ ${\ displaystyle Y}$

The Spearman's rank correlation coefficient also follows this scheme, instead of and using the ranks of and in the Bravais-Pearson correlation. By the properties of the ranks, e.g. For example , the Bravais-Pearson correlation formula can be simplified. ${\ displaystyle x_ {i}}$ ${\ displaystyle y_ {i}}$ ${\ displaystyle x_ {i}}$ ${\ displaystyle y_ {i}}$ ${\ displaystyle \ textstyle \ sum _ {i = 1} ^ {n} \ operatorname {Rank} (x_ {i}) = {\ tfrac {n (n + 1)} {2}}}$

coefficient	Range of values	comment
Covariance	in the interval ${\ displaystyle (- \ infty; + \ infty)}$	non-standardized, symmetrical, not robust, only measures the linear relationship
Bravais-Pearson correlation	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, not robust, only measures the linear relationship
Quadrant correlation	in the interval ${\ displaystyle [-1; +1]}$	standardized, symmetrical, robust, also measures non-linear relationships
Coefficient of determination	in the interval ${\ displaystyle [0; +1]}$	standardized, symmetrical, not robust, error reduction measure

For two variables of different scale levels

One possibility that is often used is to use a coefficient which is suitable for two variables of the low scale level. Is z. If, for example, one variable is ordinal, the other is scaled metrically, then one coefficient is used for two ordinal variables. One accepts that not all information in the observations is used.

This becomes very problematic when one variable is metric (continuous) and the other is nominal. Therefore a number of special coefficients have been developed for different scale levels. It is not possible to switch the roles of the variables in the formulas; H. it makes no sense to speak of symmetric or asymmetric coefficients.

coefficient	${\ displaystyle X}$	${\ displaystyle Y}$	Range of values	comment
Eta square	nominal	metric	in the interval ${\ displaystyle [0; +1]}$	Error reduction measure, not robust
Point bisiserial correlation	dichotomous	metric	in the interval ${\ displaystyle [0; +1]}$	not robust