Interrater reliability

from Wikipedia, the free encyclopedia

The inter-rater reliability or judgments Rüber match referred to empirical research (u. A. Psychology , sociology , epidemiology , etc.), the extent of matches (= concordances ) of the estimation results at different observers ( "raters"). This can be used to indicate the extent to which the results are independent of the observer, which is why, strictly speaking, it is a measure of objectivity . The reliability is a measure of the goodness of the method used for measuring a specific variable. A distinction can be made between interrater and intra-rater reliability.

Interrater reliability

The same measurement is carried out on a specific object by two different measurement personnel. The results should be the same. An example: A couple of people (person A and B) had a conversation. Two judges (raters 1 and 2) observed the two people and estimated the duration of the speeches for person A and person B. The assessments were shown in a rating scale: extremely short (−3) - very short (−2) - short (−1) - medium (0) - long (+1) - very long (+2) - extremely long (+ 3). Rat 1 estimated the speaking time of person A to be −3 and of person B to be +3. Rat 2 gave −2 for the speaking time of person A and +2 for person B.

Person A Person B
Rat 1 −3 +3
Rat 2 −2 +2

In this case one can say that the interrater reliability is not bad.

Using this principle, similar checks can be carried out with even more rattles and objects to be measured.

Intra-rate reliability

A measuring instrument takes the same measurement twice on a specific object. The results should be the same. Example: A test person is questioned by an interviewer twice and at different times.

Kappa statistics

There are a number of statistical methods that can be used to determine interrater reliability. Are there two (or more) different observers who simultaneously have several observation objects (= cases, test subjects ) ? categorically, the interrater reliability can be estimated using Cohen's kappa (for two raters) or Fleiss' kappa (for more than two raters). The Kappa statistics check the degree of concordance by including and comparing it to the degree of agreement that can typically be achieved through “random assessment”. It is assumed that the individual assessments of a rater are made completely independently of one another. Kappa can have values ​​between +1.0 (with high concordance) and (with low concordance). They are particularly suitable for variables at the nominal scale level.

The use of kappa statistics is also criticized, as the values ​​of these statistics mostly do not allow any statements due to their mathematical inadequacy; instead, Krippendorff's Alpha is recommended.

Inter-rater correlation

For higher scale levels , other methods use the Pearson's degree correlation coefficient or rank correlation coefficient according to Spearman and Kendall to determine the inter-rater correlation between two raters, with paired judgment values ​​being put in relation to one another. However, the inter-rater correlation coefficient only describes a (somehow) kind of connection between the two measurements, without deviations between the judges playing a role. So play z. B. constant mildness or rigor tendencies are irrelevant.

Example: 1 estimates Rater 4 properties on a scale as follows: ; Rater 2 judges on the same scale for the same items: . The inter-rater correlation is r = 1 and is perfect, although the judges do not agree.

An alternative for ordinally scaled data is Kendall's concordance coefficient W, which is used to calculate the degree of agreement for two or more assessors.

Intra-class correlation

For interval-scaled data, the intra-class correlation coefficient (ICC, Shrout & Fleiss 1979, McGraw & Wong 1996) describes that the two measured values ​​should have the same value. It assumes interval-scaled data and is usually calculated if there are more than two observers and / and two or more observation times are to be included.


  • J. Cohen: A coefficient for agreement for nominal scales. In: Education and Psychological Measurement. 20, 1960, pp. 37-46, doi : 10.1177 / 001316446002000104 .
  • JL Fleiss: Measuring nominal scale agreement among many raters. In: Psychological Bulletin. 76 (5), 1971, pp. 378-382, doi : 10.1037 / h0031619 .
  • KO McGraw, SP Wong: Forming inferences about some intraclass correlation coefficients. In: Psychological Methods. 1, 1996, pp. 30-46, doi : 10.1037 / 1082-989X.1.1.30 .
  • P. Shrout, JL Fleiss: Intraclass correlation: Uses in assessing rater reliability. In: Psychological Bulletin. 86, 1979, pp. 420-428, doi : 10.1037 / 0033-2909.86.2.420 .
  • M. Wirtz, F. Caspar: Assessment agreement and assessment reliability. Hogrefe, Göttingen [a. a.] 2002, ISBN 3-8017-1646-5 .

Web links

Individual evidence

  1. Markus Wirtz: Assessment agreement and assessment reliability: Methods for determining and improving the reliability of assessments using category systems and rating scales . Hogrefe, Göttingen 2002, ISBN 3-8017-1646-5 .
  2. ^ K. Krippendorff: Reliability in Content Analysis: Some Common Misconceptions and Recommendations. In: Human Communication Research. 30 (3), 2004, pp. 411-433, doi : 10.1111 / j.1468-2958.2004.tb00738.x .