Classical test theory

from Wikipedia, the free encyclopedia

The classical test theory ( KTT ) is the most widely used psychometric test theory . The focus of the model of classical test theory is on the accuracy of a measurement or on the size of the respective measurement error. Therefore it is often referred to as measurement error theory. Classical test theory tries to clarify how, based on a person's test value, conclusions can be drawn about the true expression of the personality trait to be measured .

Axioms

  1. Each test value ( ) is composed of a true feature of Share ( ) and a random measurement error Share ( ) .
  2. The expected value of the error is zero: .
  3. The measurement error is with the true value uncorrelated : .
  4. True value and error value of two different tests are independent and uncorrelated with it: .
  5. Error values of two different tests are independent and uncorrelated with it: .

The greater the measurement error, the smaller the true feature portion and the less reliable a test is.

It also follows from the first two axioms:

This means that the measurement error disappears if either a test is used on many individuals or a test is used several times on one and the same person.

Reliability

The central concept of classical test theory is reliability , which is the reliability or accuracy (freedom from measurement errors) with which a test value records the true value. Theoretically, reliability is defined as the ratio of the variance of the true values ​​to the variance of the test values:

with as the variance of the measurement error-free test value and as the variance of the measurement error.

From this representation, an initially paradoxical conclusion becomes clear: An increase in the variability of systematic errors (distortions) leads to an increase in reliability, since they are not added to , but added to.

Estimation procedure to determine the reliability

Since one does not know the true values, reliability can only be estimated. One method is the so-called split-half reliability , in which the test is split into two equal parts at item level, which are then correlated with each other. This procedure is basically only of historical importance.

A method that can be described as a generalization of split-half reliability is much more common nowadays. Each item is viewed as a separate test part and correlated with the other items on the subscale. Cronbach's alpha is often used for this , which is also used as a measure of internal consistency. The alpha coefficient is the lower limit of the reliability estimate. The Cronbach alpha assumes the items are homogeneous without checking this assumption. Therefore, instead of this coefficient, the congeneric reliability is increasingly determined, which does not require this homogeneity.

Another important estimation method is the test-retest reliability , which shows the correlation of the same test at two different points in time. The test-retest reliability is worthless if the interval between the two test times is not specified. The use of retest reliability in the case of changing constructs is nonsense (for example, the retest reliability of a test that records hunger as a construct would not record the reliability of the test, but only the volatility of the feeling of hunger). This leads to an underestimation of the reliability. Too short periods between the tests are also problematic, as memory effects can lead to an overestimation of the reliability.

Another method is the construction of parallel tests. These are tests that are believed to measure the same true values. The reliability can then be estimated by correlating two parallel tests X1 and X2 . This is also called parallel test reliability . The advantage of parallel test reliability lies in the fact that neither item homogeneity, as with Cronbach's alpha, nor temporal stability, as with retest reliability, is required, which is why it could theoretically be called the ideal solution. In practical terms, however, it is extremely difficult to design parallel test forms, which require that the corresponding items in mean item difficulty , selectivity not differ and even foreign selectivity. This contributes to the fact that this form of reliability estimation is used very rarely. In certain performance tests such as B. IQ tests, however, have to have parallel test forms anyway due to the risk of copying. The parallel test reliability can also be reported here as a beneficial side effect.

The interrater reliability is also worth mentioning . It is used in particular in the interview and observation measurement methods to estimate reliability. Cohen's Kappa is available for nominally scaled data . Intra-class correlation is used for metrically scaled data . For ordinally scaled data, Spearman's rank correlation coefficient (Spearman's rho) is an applicable measure.

objectivity

Objectivity plays a subordinate role in classical test theory. The KTT is a theory whose axioms mainly relate to measurement errors. It is thus a theory of measurement errors - and thus indirectly a theory of reliability, which is defined as freedom from (unsystematic) measurement errors. Objectivity can in this case as a sub-aspect of reliability regarded as objectivity relates to the extent to which the variance of the test value can not be starting back to a variance by the experimenter and the test conditions (z. B. experimenter effect ). Objectivity therefore excludes measurement errors caused by the investigator and the conditions (as does reliability) and can be divided into various aspects:

  • Implementation objectivity - test results do not vary due to different examination conditions in different measurement opportunities
  • Evaluation Objectivity - The test scores or results in a test do not vary due to different evaluators
  • Interpretation objectivity - the conclusions drawn from the test result do not vary due to different evaluators

The relationship to reliability is particularly evident in the last two points. Theoretically, the two aspects can be quantified by the interrater agreement . In practice, however, conditions are predominantly ensured that are believed to bring about objectivity. A test that is as standardized as possible with fixed interpretation aids in the manual is regarded as a guarantee for evaluation and interpretation objectivity. Standardized examination conditions, on the other hand, are intended to ensure objectivity. A distinction is usually only made here between given and not given .

validity

Analogous to reliability, validity in classical test theory can be understood as the portion of the variance that is solely due to the construct to be measured and not to unsystematic, random errors or systematic distortions.

with as the variance, which can be traced back exclusively to the construct to be examined, as the variance of the systematic distortions ( English bias ) and as the variance of the measurement error.

In contrast to reliability, here an increase in the systematic error leads to a decrease, which is intuitively understandable.

Estimation procedure to determine the validity

The validity of a test is much more difficult to estimate than its reliability. On the one hand, this is due to the fact that, unlike reliability, validity is a very inconsistent term that can be estimated in practice using a large number of different types of indicators. On the other hand, there are also aspects of validity that cannot be quantified or that is not so common in test construction practice. Three upper forms of (psychometric) validity are relevant for the test construction:

  • Content validity : Affects u. a. the question of whether items are really suitable for capturing a certain construct. In practice, it is either taken as given or not given by expert judgments. At least there is the theoretical possibility of using them e.g. B. using interrater measures of agreement regarding expert judgments about items.
  • Construct Validation : Is related to content validity. However, this is more about intersubjectively (empirically-quantitatively) verifiable indications that the relevant construct is actually being measured and nothing else than the content validity. This is done in several ways:
    1. Internal structure / factorial validity - verifiable with EFA , CFA and SEM
    2. Discriminant and convergent validity with dissimilar / related tests that measure the same / different construct. Determined z. B. by bivariate correlations. MTMM applicable, confirmatory test e.g. B. by CFA.
  • Criterion validity : In practice one of the most important quality criteria. Indicates how well, for example, the results of other tests or behavior can be predicted by the test result and corresponds to the correlation with the external criterion (e.g. correlation between intelligence and professional success). A distinction can be made between the test result and the criterion based on the time relation:
    1. Retrospective validity - How strongly does a current measurement correlate with measurements in the past that are caused by the same construct
    2. Competitive validity - How highly correlates a current measurement with current other measurements that are caused by the same construct
    3. Predictive validity - How strongly does a measurement correlate with measurements that were carried out later and are caused by the same construct

advantages

  • The assumptions of the classical test theory are kept simple and mathematically quite undemanding in contrast to the probabilistic test theory
  • The KTT has already been implemented in many tests and has therefore proven itself in practice.

criticism

  • Perhaps the assumption is too rough as different types of errors would have to be taken into account. The extended latent state trait model (Steyer and others) offers a further approach.
  • The sample dependence of reliability, item difficulty and item selectivity is not or only insufficiently considered in the KTT.
  • The homogeneity of items cannot be checked in the KTT.
  • According to the dilution paradox , the criteria- related validity of a test decreases with increasing reliability of the criterion and the validated test.
  • Classical test theory can only measure stable personality traits . If the true value were to change, this would contradict the second axiom, that the expected value and mean value of the errors or the sum of the errors are equal to zero.
  • Data at the level of an interval scale are assumed, since mean values ​​and variances are calculated.

Alternative psychometric models

The evaluation of psychometric data can also be carried out using latent trait theories (e.g. Rasch model ). These can solve some of the problems associated with the KTT, but also create new ones (see also probabilistic test theory ).

literature

  • Gustav A. Lienert, Ulrich Raatz: Test setup and test analysis. 6th edition. Beltz-Verlags-Union, Weinheim 1998, ISBN 3-621-27424-3 .
  • Helfried Moosbrugger , Augustin Kelava (Ed.): Test theory and questionnaire construction. 2. update Edition. Springer-Medizin-Verlag, Heidelberg 2012, ISBN 978-3-642-20071-7 .
  • Frederic M. Lord, Melvin R. Novick: Statistical theories of mental test scores. Addison-Wesley, Reading MA et al. 1968, ISBN 0-201-04310-6 .

Web links

Individual evidence

  1. a b c Schmitz-Atzert, Amelang: Psychological diagnostics . 5th, completely revised and expanded edition. Springer, Berlin / Heidelberg 2012, ISBN 978-3-642-17000-3 , pp. 40 ff .
  2. a b Hermann-Josef Fisseni: Textbook of psychological diagnostics . 3rd, revised and expanded edition. Hogrefe, Göttingen 2004, ISBN 3-8017-1756-9 , p. 81 .
  3. ^ Hermann-Josef Fisseni: textbook of psychological diagnostics . 3rd, revised and expanded edition. Hogrefe, Göttingen 2004, ISBN 3-8017-1756-9 , 4.3.3.4.
  4. ^ Hermann-Josef Fisseni: textbook of psychological diagnostics . 3rd, revised and expanded edition. Hogrefe, Göttingen 2004, ISBN 3-8017-1756-9 , p. 50 .