Contingency coefficient

The contingency coefficient ${\ displaystyle {\ boldsymbol {C}}}$ (according to Karl Pearson ) is a statistical measure of correlation . Pearson's contingency coefficient expresses the strength of the relationship between two (or more) nominal or ordinal variables. It is based on the comparison of actually determined frequencies of two features with the frequencies that one would have expected if these features were independent.

Quadratic contingency

The quadratic contingency or the chi-square coefficient , on which the contingency coefficient is based, is a measure of the relationship between the characteristics under consideration: ${\ displaystyle {\ boldsymbol {\ chi}} ^ {2}}$

{\ displaystyle \ chi ^ {2} = \ sum _ {i = 1} ^ {I} \ sum _ {j = 1} ^ {J} {\ frac {\ left (n_ {ij} - {\ frac { \ displaystyle n_ {i \ cdot} n _ {\ cdot j}} {n}} \ right) ^ {2}} {\ displaystyle {\ frac {n_ {i \ cdot} n _ {\ cdot j}} {n} }}}}

The informative value of the coefficient is low because its upper limit, i.e. H. the value it assumes when the characteristics observed are completely dependent on the size (dimension) of the contingency table (i.e. on the number of occurrences of the variables) and the size of the totality examined . A comparability of values of the coefficient across different contingency tables and sample sizes is therefore not given. With complete independence of the characteristics . ${\ displaystyle \ chi ^ {2}}$ ${\ displaystyle n}$ ${\ displaystyle \ chi ^ {2}}$ ${\ displaystyle \ chi ^ {2} = 0}$

The following applies:

{\ displaystyle 0 \ leq \ chi ^ {2} \ leq n \ cdot (k-1)}

,

where denotes the minimum of the number of rows and the number of columns in the contingency table. ${\ displaystyle k = \ min (I, J)}$ ${\ displaystyle I}$ ${\ displaystyle J}$

use

The size is required to determine the contingency coefficient . The quantity is also used in statistical tests (see chi-square test ). ${\ displaystyle \ chi ^ {2}}$ ${\ displaystyle {\ boldsymbol {C}}}$ ${\ displaystyle \ chi ^ {2}}$

example

The following contingency table emerged from a survey:

	${\ displaystyle {\ textbf {Limousine}}}$	${\ displaystyle {\ textbf {Combi}}}$	${\ displaystyle {\ textbf {total}}}$
${\ displaystyle {\ textbf {workers}}}$	${\ displaystyle 19}$	${\ displaystyle 18}$	${\ displaystyle 37}$
${\ displaystyle {\ textbf {Employees}}}$	${\ displaystyle 43}$	${\ displaystyle 20}$	${\ displaystyle 63}$
${\ displaystyle {\ textbf {total}}}$	${\ displaystyle 62}$	${\ displaystyle 38}$	${\ displaystyle 100}$

Calculation of the coefficient: ${\ displaystyle \ chi ^ {2}}$

{\ displaystyle {\ frac {\ left (19- \ displaystyle {\ frac {37 \ cdot 62} {100}} \ right) ^ {2}} {\ displaystyle {\ frac {37 \ cdot 62} {100} }}} + {\ frac {\ left (18- \ displaystyle {\ frac {37 \ cdot 38} {100}} \ right) ^ {2}} {\ displaystyle {\ frac {37 \ cdot 38} {100 }}}} + {\ frac {\ displaystyle \ left (43- \ displaystyle {\ frac {63 \ cdot 62} {100}} \ right) ^ {2}} {\ displaystyle {\ frac {63 \ cdot 62 } {100}}}} + {\ frac {\ displaystyle \ left (20- \ displaystyle {\ frac {63 \ cdot 38} {100}} \ right) ^ {2}} {\ displaystyle {\ frac {63 \ cdot 38} {100}}}} = 2 {,} 83}

Mean square contingency

Another measure to indicate the strength of the dependence of the characteristics in a contingency table is the mean quadratic contingency, which is essentially an extension of the coefficient: ${\ displaystyle \ chi ^ {2}}$

{\ displaystyle {\ frac {\ chi ^ {2}} {n}} = {\ frac {1} {n}} \ sum _ {i = 1} ^ {I} \ sum _ {j = 1} ^ {J} {\ frac {\ left (n_ {ij} - \ displaystyle {\ frac {n_ {i \ cdot} n _ {\ cdot j}} {n}} \ right) ^ {2}} {\ displaystyle { \ frac {n_ {i \ cdot} n _ {\ cdot j}} {n}}}}}

The larger this measure, the stronger the relationship between the two analyzed features. If the two characteristics are independent, then every summand becomes due to the numerator of the fraction, and so does the measure itself. In the case of a ( ) contingency table, the measure is standardized and takes on values in the interval . ${\ displaystyle 0}$ ${\ displaystyle 2 \ times 2}$ ${\ displaystyle [0,1]}$

Contingency coefficient according to Karl Pearson

${\ displaystyle \ chi ^ {2}}$ can in principle assume very large values and is not restricted to the interval . In order to eliminate the dependence of the coefficient on the sample size, the contingency coefficient (also or ) according to Karl Pearson is determined on the basis of : ${\ displaystyle [0,1]}$ ${\ displaystyle \ chi ^ {2}}$ ${\ displaystyle C}$ ${\ displaystyle CC}$ ${\ displaystyle K}$

{\ displaystyle C = {\ sqrt {\ frac {\ chi ^ {2}} {\ chi ^ {2} + n}}}}

.

with the sample size. ${\ displaystyle n}$

This can assume values in the interval . The problem is that the upper limit of the contingency coefficient depends on the number of dimensions considered: ${\ displaystyle [0,1)}$ ${\ displaystyle C}$

It applies to the minimum the number of rows and the number of columns in the contingency table. ${\ displaystyle C \ in \ left [0, {\ sqrt {\ frac {k-1} {k}}} \ right]}$ ${\ displaystyle k = \ min (I, J)}$ ${\ displaystyle I}$ ${\ displaystyle J}$

Corrected contingency coefficient

In addition to the influence of the sample size, the influence of the dimension of the contingency table (the number of characteristic values) on the upper limit of the coefficient and thus to ensure the comparability of results, the corrected contingency coefficient is (often also ) used to measure the relationship: ${\ displaystyle C _ {\ mathrm {korr}}}$ ${\ displaystyle K ^ {*}}$

{\ displaystyle C _ {\ mathrm {korr}} = {\ sqrt {\ frac {k} {k-1}}} \ cdot C = {\ sqrt {\ frac {k} {k-1}}} \ cdot {\ sqrt {\ frac {\ chi ^ {2}} {n + \ chi ^ {2}}}}}

,

with as above. ${\ displaystyle k}$

The following applies : A near indicates independent characteristics, a near indicates a high degree of dependence between the characteristics. ${\ displaystyle 0 \ leq C _ {\ mathrm {korr}} \ leq 1}$ ${\ displaystyle C _ {\ mathrm {korr}} \,}$ ${\ displaystyle 0}$ ${\ displaystyle C _ {\ mathrm {korr}} \,}$ ${\ displaystyle 1}$

A corrected contingency coefficient results for the example . ${\ displaystyle C _ {\ mathrm {korr}} = {\ sqrt {\ frac {2} {2-1}}} \ cdot 0 {,} 166 = 0 {,} 234}$

Cramérs V

Cramérs (English: Cramér's ) is a contingency coefficient, more precisely a -based measure of relationship . It is named after the Swedish mathematician and statistician Harald Cramér . ${\ displaystyle {\ boldsymbol {V}}}$ ${\ displaystyle V}$ ${\ displaystyle \ chi ^ {2}}$

Cramérs is a -based measure. Cramérs is a symmetrical measure of the strength of the relationship between two or more nominally scaled variables when (at least) one of the two variables has more than two values. For a table, Cramérs corresponds to the absolute value of the Phi coefficient . ${\ displaystyle V}$ ${\ displaystyle \ chi ^ {2}}$ ${\ displaystyle V}$ ${\ displaystyle 2 \ times 2}$ ${\ displaystyle V}$

Action

{\ displaystyle V = {\ sqrt {\ frac {\ chi ^ {2}} {n \ cdot (k-1)}}}}

.

{\ displaystyle n}

: Total number of cases (sample size)

{\ displaystyle k = \ min (I, J)}

the minimum of the number of rows and the number of columns in the contingency table

{\ displaystyle I}

{\ displaystyle J}

interpretation

Cramérs is between and in every crosstab, regardless of the number of rows and columns . It can be used with cross tables of any size . Since Cramérs is always positive, no statement can be made about the direction of the relationship. ${\ displaystyle V}$ ${\ displaystyle 0}$ ${\ displaystyle 1}$ ${\ displaystyle V}$

Phi coefficient ϕ

The Phi coefficient (also four-field correlation coefficient, four-field coefficient) (also ) is a measure of the strength of the relationship between two dichotomous features. ${\ displaystyle \ phi \,}$ ${\ displaystyle {\ widehat {r _ {\ phi}}}}$

calculation

In order to estimate the four-field correlation between two dichotomous features and , one first sets up a contingency table that contains the common frequency distribution of the features. ${\ displaystyle A}$ ${\ displaystyle B}$

	${\ displaystyle A = 0}$	${\ displaystyle A = 1}$	${\ displaystyle {\ textbf {total}}}$
${\ displaystyle B = 0}$	${\ displaystyle a}$	${\ displaystyle b}$	${\ displaystyle a + b}$
${\ displaystyle B = 1}$	${\ displaystyle c}$	${\ displaystyle d}$	${\ displaystyle c + d}$
${\ displaystyle {\ textbf {total}}}$	${\ displaystyle a + c}$	${\ displaystyle b + d}$	${\ displaystyle a + b + c + d}$

With the data from the table you can use the formula ${\ displaystyle \ phi \;}$

{\ displaystyle \ phi = {\ frac {a \ cdot db \ cdot c} {\ sqrt {(a + b) \ cdot (c + d) \ cdot (a + c) \ cdot (b + d)}} }}

to calculate. The formula results from the more general definition of the correlation coefficient in the special case of two binary random variables and . ${\ displaystyle \ rho (A, B)}$ ${\ displaystyle A}$ ${\ displaystyle B}$

Examples

Measure the association between ...

... approval or rejection of a political decision and gender, ...

… Showing or not showing a commercial and buying or not buying a product.

Applying to a confusion matrix with two classes. ${\ displaystyle \ phi}$

Note

Between and is the connection or , where the number of observations indicated. This is the square root (the sign does not matter) from the mean square contingency (see above). ${\ displaystyle \ phi \,}$ ${\ displaystyle \ chi ^ {2} \,}$ ${\ displaystyle \ chi ^ {2} = n \ cdot \ phi ^ {2}}$ ${\ displaystyle \ phi ^ {2} = {\ frac {\ chi ^ {2}} {n}}}$ ${\ displaystyle n \,}$ ${\ displaystyle \ phi \,}$

The test statistic used is , assuming that equals zero, -distributed with one degree of freedom . ${\ displaystyle n \ cdot \ phi ^ {2}}$ ${\ displaystyle \ phi \,}$ ${\ displaystyle \ chi ^ {2}}$

Phi as a measure of the strength of the effect

If a measure is sought to determine the effect size with orientation on probabilities, it can be used. Since crosstabs that do not contain absolute frequencies, but rather probabilities, always appear in the place where the case number is normally to be found , the following is identical to Cohen's : ${\ displaystyle \ phi}$ ${\ displaystyle 1}$ ${\ displaystyle \ phi}$ ${\ displaystyle w}$

{\ displaystyle \ phi = {\ sqrt {\ frac {\ chi ^ {2}} {n}}} = {\ sqrt {\ frac {\ chi ^ {2}} {1}}} = {\ sqrt { \ chi ^ {2}}} = w}

It is not calculated in relation to absolute frequencies, but in relation to probabilities. To Cohens . and It is also numerically identical if, with regard to crosstabs that contain probabilities, is calculated as with . ${\ displaystyle \ chi ^ {2}}$ ${\ displaystyle w}$ ${\ displaystyle V \ cdot {\ sqrt {k-1}}}$ ${\ displaystyle k = \ min (I, J)}$

literature

J. Bortz, GA, Lienert, K. Boehnke: Distribution-free methods in biostatistics. Springer, Berlin 1190 (Chapter 8.1, p. 326 and p. 355 ff).
JM Diehl, HU Kohr: Descriptive Statistics. 12th edition. Klotz Eschborn 1999, p. 161.
P. Zöfel: Statistics for psychologists. Pearson studies, Munich 2003.
Significance test for the four-field correlation (PDF; 13 kB).

Web links

Phi coefficient online calculator

Individual evidence

↑ ^a ^b Backhaus: Multivariate Analysis Methods . 11th edition. Springer, 2006, p. 241,700 .

^ W. Kohn: Statistics. Data analysis and probability theory . Springer, 2005, p. 115 .

^ W. Kohn: Statistics. Data analysis and probability theory . Springer, 2005, p. 114 .

↑ H. Toutenburg, C. Heumann: Descriptive Statistics: An Introduction to Methods and Applications with R and SPSS . 6th edition. Springer, 2008, p. 115 .

↑ Bernd Rönz, Hans Gerhard Strohe (Ed.): Lexicon Statistics . Gabler, Wiesbaden 1994, p. 25 .

↑ J. Bortz: Statistics for human and social scientists. 6th edition. Springer 2005, pp. 167-168.

↑ D. Wentura: A Small Guide to Test Strength Analysis. Department of Psychology at Saarland University 2004, p. 6, ( researchgate.net ).

^ Jacob Cohen: Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Lawrence Erlbaum Associates, Hillsdale 1988, ISBN 0-8058-0283-5 .

[SchulzeS125-1] Backhaus: Multivariate Analysis Methods . 11th edition. Springer, 2006, p. 241,700 .

[2] W. Kohn: Statistics. Data analysis and probability theory . Springer, 2005, p. 115 .

[3] W. Kohn: Statistics. Data analysis and probability theory . Springer, 2005, p. 114 .

[4] H. Toutenburg, C. Heumann: Descriptive Statistics: An Introduction to Methods and Applications with R and SPSS . 6th edition. Springer, 2008, p. 115 .

[5] Bernd Rönz, Hans Gerhard Strohe (Ed.): Lexicon Statistics . Gabler, Wiesbaden 1994, p. 25 .

[bortz2005-6] J. Bortz: Statistics for human and social scientists. 6th edition. Springer 2005, pp. 167-168.