Contingency table

Contingency tables (also: contingency tables or crosstabs ) are tables that contain the absolute or relative frequencies ( frequency tables ) of combinations of certain characteristics. Contingency means that two characteristics appear together. This means that there are frequencies for each other by multiple "and" or ", and" ( conjunction ) associated features shown. These frequencies are supplemented by their marginal sums, which form the so-called marginal frequencies . The frequent special case of a contingency table with two characteristics is a confusion matrix .

Structure and application

In contrast to a normal ( "flat") table that has these attributes in the 1st line attribute name and in all other rows forms, in a cross table both row and column headings characteristic values , and at the intersection of the respective column and row a value is shown that depends on the characteristics specified in the respective column and row.

${\ displaystyle X}$ \ ${\ displaystyle Y}$	${\ displaystyle y_ {1}}$	${\ displaystyle y_ {2}}$	${\ displaystyle \ ldots}$	${\ displaystyle y_ {K}}$	Edge frequency of ${\ displaystyle X}$
${\ displaystyle x_ {1}}$	${\ displaystyle h_ {11}}$	${\ displaystyle h_ {12}}$	${\ displaystyle \ ldots}$	${\ displaystyle h_ {1K}}$	${\ displaystyle h_ {1 \ bullet}}$
${\ displaystyle x_ {2}}$	${\ displaystyle h_ {21}}$	${\ displaystyle h_ {22}}$	${\ displaystyle \ ldots}$	${\ displaystyle h_ {2K}}$	${\ displaystyle h_ {2 \ bullet}}$
${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$	${\ displaystyle \ ddots}$	${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$
${\ displaystyle x_ {J}}$	${\ displaystyle h_ {J1}}$	${\ displaystyle h_ {J2}}$	${\ displaystyle \ ldots}$	${\ displaystyle h_ {JK}}$	${\ displaystyle h_ {J \ bullet}}$
Edge frequency of ${\ displaystyle Y}$	${\ displaystyle h _ {\ bullet 1}}$	${\ displaystyle h _ {\ bullet 2}}$	${\ displaystyle \ ldots}$	${\ displaystyle h _ {\ bullet K}}$	${\ displaystyle h _ {\ bullet \ bullet}}$

A general crosstab for two variables and is shown on the right. The characteristics of the variables and the variables are given above and on the left. The number of occurrences and can be different for both variables. If it is the same, one speaks of square cross tables. ${\ displaystyle X}$ ${\ displaystyle Y}$ ${\ displaystyle x_ {1}, \ dotsc, x_ {J}}$ ${\ displaystyle X}$ ${\ displaystyle y_ {1}, \ dotsc, y_ {K}}$ ${\ displaystyle Y}$ ${\ displaystyle J}$ ${\ displaystyle K}$

The table shows the absolute frequencies , i.e. H. the number of observations in which both the characteristic expression and occurs. The marginal frequencies are shown on the right and the marginal frequencies below . ${\ displaystyle h_ {jk}}$ ${\ displaystyle x_ {j}}$ ${\ displaystyle y_ {k}}$ ${\ displaystyle h_ {j \ bullet} = h_ {j1} + \ dotsb + h_ {jK}}$ ${\ displaystyle h _ {\ bullet k} = h_ {1k} + \ dotsb + h_ {Jk}}$

Finally, at the bottom right is the sum of the marginal frequencies

${\ displaystyle h _ {\ bullet \ bullet} = h_ {1 \ bullet} + \ dotsb + h_ {J \ bullet} = h _ {\ bullet 1} + \ dotsb + h _ {\ bullet K} = n}$ ,

where is the number of observations in the data set. ${\ displaystyle n}$

Instead of absolute frequencies, relative frequencies can also be displayed. In this case, instead of is used often and of course it applies . ${\ displaystyle h}$ ${\ displaystyle f}$ ${\ displaystyle f _ {\ bullet \ bullet} = 1}$

Four-field table

A four-field table is a special form of a two-dimensional contingency table. Both variables have only two characteristics and they are structured as follows:

feature	${\ displaystyle B}$	${\ displaystyle {\ bar {B}}}$	total
${\ displaystyle A}$	${\ displaystyle h ({A} \ cap {B})}$	${\ displaystyle h ({A} \ cap {\ bar {B}})}$	${\ displaystyle h (A)}$
${\ displaystyle {\ bar {A}}}$	${\ displaystyle h ({\ bar {A}} \ cap {B})}$	${\ displaystyle h ({\ bar {A}} \ cap {\ bar {B}})}$	${\ displaystyle h ({\ bar {A}})}$
total	${\ displaystyle h (B)}$	${\ displaystyle h ({\ bar {B}})}$	${\ displaystyle n}$

Example of a two-dimensional contingency table

2000 people are asked whether they prefer product A or B. The result is evaluated according to the gender of the respondent. The following four-field table results

with absolute frequencies

Product \ gender	Female	male	total
Product A	660	340	1000
Product B	340	660	1000
total	1000	1000	2000

with relative frequencies based on the number of cases

Product \ gender	Female	male	total
Product A	0.33	0.17	0.5
Product B	0.17	0.33	0.5
total	0.5	0.5	1

with relative frequencies based on the columns

Product \ gender	Female	male	total
Product A	0.66	0.34	1
Product B	0.34	0.66	1
total	1	1

with relative frequencies based on the rows

Product \ gender	Female	male	total
Product A	0.66	0.34	1
Product B	0.34	0.66	1
total	1	1

Appearances can be deceptive

At first glance, it can be seen that female customers tend towards product A, while male customers tend towards product B. This can be interesting information - but it can also just be a fallacy. The evaluation of the survey with regard to the age of the customers shows:

Product \ age	up to 40 years	over 40 years	total
Product A	700	300	1000
Product B	300	700	1000
total	1000	1000	2000

Buying behavior therefore depends not only on gender, but also on the age of the respondents. The need to bring both information about dependencies into a realistic relationship to one another forces the development of a three-dimensional contingency table.

In order to be able to infer properties of the underlying populations from the relationships in the examined samples , chi-square tests can be used (under certain conditions) . The exact Fisher test is a statistical test of independence in contingency for small samples.

Categories to be used in contingency tables

In particular, the statistical procedures, which are based on contingency tables, place requirements on the categories (a single characteristic value or a combination of different characteristic values):

Strictly speaking, all categories must be completely independent of one another. For example, a person cannot be “female” and “male” at the same time (except in rare cases of intersex , which are neglected here); but with “attended elementary school” and “completed apprenticeship” you can actually include the members of the latter group in the first - as attendance at elementary school is mandatory for everyone (in western societies). The problem is that the marginal frequencies do not then add to or add up. ${\ displaystyle n}$ ${\ displaystyle 1}$
Furthermore, there should be no rows or columns in the contingency table in which the frequencies add up to zero. For example, such a panel cannot have the categories “male” and “female” when examining an exclusively male or exclusively female population. The problem is that the reciprocal of this sum occurs in the static evaluation and 1/0 is not defined.
In addition, a “Other” category should be used as rarely as possible; for example as in “drives Opel”, “drives Peugeot”, “drives Toyota”, “drives another passenger car”. If it becomes necessary, this “collecting pot” should be kept as small as possible through a well thought-out concept.

Three-dimensional contingency table

For a three-dimensional table (three characteristics), additional columns are added to the table:

	Gender Female		Gender Male
Product \ age	up to 40 years	over 40 years	up to 40 years	over 40 years	total
Product A	630 (70%)	30 (30%)	70 (70%)	270 (30%)	1000
Product B	270 (30%)	70 (70%)	30 (30%)	630 (70%)	1000
total	900 (100%)	100 (100%)	100 (100%)	900 (100%)	2000

The percentages added in brackets are only intended to draw attention to the fact that product propensity was in no way dependent on gender: 70% of younger women as well as men and 30% of older women as well as men are equally inclined to product A; with product B it is exactly the opposite.

To make this phenomenon more plausible, it may be worthwhile to look at a (this time again two-dimensional) contingency table:

Gender \ age	up to 40 years	over 40 years	total
Female	900	100	1000
Male	100	900	1000
total	1000	1000	2000

It becomes clear here that an overwhelming majority of 90% were female among the younger respondents. The younger customers prefer product A - not the female ones! On the other hand, older people (mainly men in the survey) prefer product B. The gender ratio in the example is only an apparent ratio that could arise due to the unbalanced statistical amount.

Graphical representation

3D bar charts are ideal for the graphic representation of two-dimensional contingency tables. A disadvantage of such diagrams, however, is that bars can be covered depending on the perspective. In addition, the 3-D display introduces a perspective that can make it difficult for the viewer to compare the height of the bars with one another in order to see which cell contains more observations.

Another option that is particularly useful for contingency tables with relatively few cells is a stacked bar chart that relates to the relative column frequencies.

It is better to use a mosaic plot in which the areas correspond to the frequencies for each combination of characteristic values. In addition, the independence of two or more variables can easily be displayed.

3D bar chart of the results of the parliamentary elections in Ukraine on September 30, 2007, broken down by regions and parties.
Stacked column chart referring to relative column frequencies (fictitious data).
Mosaic plot of the frequencies of passengers on the Titanic according to the variables class (1st class, 2nd class, 3rd class, crew), gender (male, female) and survived (yes, no).

statistical evaluation

As the contingency tables become more complex, relationships can no longer be read off simply with the eye. Statistics use a number of methods for systematic analysis:

Relationship measures:
- Contingency Coefficient : -Coefficient, (corrected) Contingency Coefficient, Cramérs V and Phi Coefficient ${\ displaystyle \ chi ^ {2}}$
- Error reduction measures : Goodman and Kruskals λ and τ as well as the uncertainty coefficient
Testing:
Further analysis methods:
- Log-linear model

Individual evidence

↑ Heiner Abels: Handbook of the statistical diagram: construction, interpretation and manipulation of graphic representations (German edition) . Verlag Neue Wirtschafts-Briefe, 1981, ISBN 978-3-482-56581-6 .

Web links

Video on the crosstab ( WMV ; 19.6 MB)

[1] Heiner Abels: Handbook of the statistical diagram: construction, interpretation and manipulation of graphic representations (German edition) . Verlag Neue Wirtschafts-Briefe, 1981, ISBN 978-3-482-56581-6 .