Cohen's Kappa

from Wikipedia, the free encyclopedia

Cohen's kappa is a statistical measure of the interrater reliability of assessments by (usually) two assessors (rulers), which Jacob Cohen proposed in 1960. This measure can also be used for the intra-rate reliability , in which the same observer uses the same measurement method at two different points in time. The equation for Cohen's kappa is

where is the measured match score of the two estimators and the randomly expected match. If the raters agree in all their judgments, is . If only matches between the two guessers can be determined that mathematically correspond to the extent of the chance, it assumes a value of . (On the other hand, negative values ​​indicate a match that is even less than a random match.)

Greve and Wentura (1997, p. 111) suggest that values ​​of 0.40 to 0.60 are still acceptable, but values ​​below 0.40 should be viewed with skepticism. Interrater reliability values ​​from are good to excellent.

Landis and Koch (1977) suggest: = "poor agreement", = "slight agreement", 0.21-0.40 = "sufficient (fair) agreement", 0.41-0, 60 = "medium (moderate) agreement", 0.61-0.80 = "considerable (substantial) agreement", 0.81-1.00 = "(almost) perfect) agreement".

The problem with the coefficient is that its maximum value is not always one (see below).

Nominal scales, two guessers

If only matches and disagreements between the two raters are checked, all assessment differences that arise have the same weight. This is particularly useful for nominal scales. The data material (i.e. the judgment frequencies ) for an item or characteristic with (nominal) categories from both assessors can be recorded in a contingency table (i.e. with rows and columns):

  Rat B Marginal frequencies
Rat A ...
...
. . ... . .
. . ... . .
. . ... . .
...
Marginal frequencies ...

Then the following applies to the proportion of the raters who agree with each other (= middle diagonal of the contingency table) :

,

where corresponds to the total number of assessment objects (persons / items / objects).

For the expected matches, the products of the marginal sums (= row sum × column sum) of a category are added up and then set in relation to the square of the total sum:

.

Scott (1955) proposed for his coefficient , which is calculated using the same starting formula as, to determine the expected matches as follows:

.

If the marginal distributions are different, Scotts is always larger than Cohen's .

As soon as a cell on the other side of the diagonal is filled in the contingency table (i.e. differences in assessment occur), the maximum value of Cohen's kappa depends on the marginal distributions. It becomes smaller the further the marginal distributions move away from a uniform distribution. Brennan and Prediger (1981) propose a corrected kappa value here , which is defined as , where, as above, is the number of categories (i.e. the characteristic values). Thus :

Diligent kappa

The expansion of the formulas to more than two guessers is in principle unproblematic. The expansion of the statistics is also known as Fleiss' Kappa. For the proportion of matches that have occurred, z. B. for three guessers

and

.

For the coefficient of Brennan and Prediger (1981), von Eye (2006, p. 15) suggests the following expansion to include raters:

where is an index for the match cells (diagonals).

If, as above, the number of categories is ( ) and the number of raters (= number of assessments per characteristic / item / person) and where the total number of assessment objects (cases / persons / items / objects) is assessed , the following applies:

  • is the number of raters who rated the object to be assessed appropriately in the category .
  • is the sum of all cases in the assessment category .
  • is the proportion of all cases in the assessment category of all ( ) assessments overall.

The degree of judge agreement at the . Case (= with the . Person / item / object) is then calculated as

The mean value over all and the expected value for chance are included in the formula :

.
1 2 3 4th 5
1 0 0 0 0 14th 1,000
2 0 2 6th 4th 2 0.253
3 0 0 3 5 6th 0.308
4th 0 3 9 2 0 0.440
5 2 2 8th 1 1 0.330
6th 7th 7th 0 0 0 0.462
7th 3 2 6th 3 0 0.242
8th 2 5 3 2 2 0.176
9 6th 5 2 1 0 0.286
10 0 2 2 3 7th 0.286
total 20th 28 39 21st 32
0.143 0.200 0.279 0.150 0.229
Example table for calculating Fleiss' kappa

example

In the following calculation example, raters assess cases on a scale with categories.

The categories are in the columns, the cases in the rows. The sum of all ratings .

For example, is in the first column

and on the second line

So it results for

and

(That this is so similar to is coincidence.)

Multiple grading of the objects to be measured, two raters

If the raters are asked to classify the objects of estimation several times (i.e. instead of the k nominal categories it is now a question of gradations and at least one ordinal scale level can be assumed for these gradations ), larger discordant deviations between the raters should be more significant than smaller deviations . In this case, a weighted kappa should be calculated in which a weighting factor is defined for each cell ij of the contingency table . B. could be based on how large the deviation from the center diagonal is (e.g. as squared deviations center diagonal cells = 0, deviations by 1 category = 1, deviations by 2 categories = = 4, etc.). Then the following applies for this (weighted) kappa (cf. Bortz 1999):

Alternatives to this coefficient are the Spearman rank correlation coefficient and Kendall's rank correlation coefficient (Kendall's tau) and the Kendall's concordance coefficient W.

Cardinal scale kappa

This notion of weighting can also be continued: At the interval scale level , the extent of the difference (or similarity) between the assessments given can even be directly quantified (Cohen 1968, 1972). The weighting values ​​for each cell of the contingency table are then based on the maximum and minimum difference.

For the cardinal scale applies that identical assessments (or the minimum difference between observers) should be weighted standardized with the value 0 and the maximum observer difference with a value of 1 (and the other observed differences in each case in their ratio to it):

and for the [0,1] standardization of the weights:

.

The weighted kappa is a special case of the intra-class correlation coefficient (Fleiss & Cohen 1973).

Individual evidence

  1. Kilem Li Gwet: intrarater Reliability . In: Wiley Encyclopedia of Clinical Trials . John Wiley & Sons, 2008 ( agreestat.com [PDF]).

Literature and Sources

  • J. Bortz: Statistics for social scientists. 5th edition. Springer, Berlin 1999.
  • J. Bortz, GA Lienert, K. Boehnke: Distribution-free methods in biostatistics. Chapter 9. Springer, Berlin 1990.
  • RL Brennan, DJ Prediger: Coefficient : Some uses, misuses, and alternatives. In: Educational and Psychological Measurement. 41, 1981, pp. 687-699.
  • J. Cohen: A coefficient of agreement for nominal scales. In: Educational and Psychological Measurement. 20, 1960, pp. 37-46.
  • J. Cohen: Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. In: Psychological Bulletin. 1968, pp. 213-220.
  • J. Cohen: Weighted chi square: An extension of the kappa method. In: Education and Psychological Measurement. 32, 1972, pp. 61-74.
  • JL Fleiss: The measurement of interrater agreement. In: ders., Statistical methods for rates and proportions. 2nd Edition. John Wiley & Sons, New York 1981, pp. 212-236, chapter 13.
  • JL Fleiss, J. Cohen: The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. In: Educational and Psychological Measurement. 33, 1973, pp. 613-619.
  • W. Greve, D. Wentura: Scientific Observation: An Introduction. PVU / Beltz, Weinheim 1997.
  • JR Landis, GG Koch: The measurement of observer agreement for categorical data. In: Biometrics. 33, 1977, pp. 159-174.
  • WA Scott: Reliability of content analysis: The case nominal scale coding. In: Public Opinion Quarterly. 19, 1955, pp. 321-325.
  • A. von Eye: An Alternative to Cohen's . In: European Psychologist. 11, 2006, pp. 12-24.

Web links