Item analysis

from Wikipedia, the free encyclopedia
One-dimensional congeneric measurement model with items

An item analysis uses a bundle of statistical procedures to examine the suitability of individual items , the values ​​of which were generated, for example, by questions in a written survey , with regard to the objective of the survey.

The aim is to test the quality of a scale ( scale here means an instrument for measuring certain variables, e.g. basic political attitudes or the environmental awareness of the test subject) by checking the items and, if necessary, improving it. The task of the item analysis is therefore to check the usefulness of individual items for a specific test.

Item analysis is a central instrument for test construction and test evaluation and, by definition, comes closest to the essence of reliability (as a test quality criterion). Decisive for the test as a whole, however, are its quality criteria and in particular the question of validity , i.e. i.e. what the test (i.e. all items taken together) is actually supposed to measure.

definition

A precise definition of the term item analysis is not specified in the literature. It is used to determine empirical psychometric criteria for individual test items. The item analysis in the classic test construction usually includes:

  • the calculation of statistical parameters
    • Item difficulty
    • Selectivity
    • Homogeneity as well as the
  • Dimensionality check.

The analysis is performed on a sample that is intended to represent the population for which the test was designed. The data of the item analysis are used for the selection and revision of items, for their ranking within the test and possibly for the design of a parallel test.

Analysis of the raw score distribution

It is possible to display the test values ​​graphically (e.g. histogram). This enables an initial overview of the distribution of the values. The main interest here is the scatter and the answer to the question of whether the raw value distribution corresponds to a normal distribution. Since many inferential statistical methods require a normal distribution , a corresponding distribution is desirable.

Statistical parameters

Item difficulty

The item difficulty is marked by an index that corresponds to the proportion of those people who solve the item correctly or answer it in the affirmative (Bortz & Döring, 2005). This is why this index used to be called the popularity index.

The purpose of the difficulty index is to distinguish between test persons with a high level of characteristics and test persons with a low level of characteristics. The ability of an item to make this distinction is called selectivity. With classic test construction, items with medium item difficulty usually have the best selectivity. In extreme terms, all items that were answered in the affirmative by all test persons or items that could not be solved by any test person are unusable. The items that do not belong to these two classes should be selected with the difficulty index. An item difficulty of 50% is considered optimal, with items under 20% and over 80% being eliminated as a rule. However, if one were to choose only items with an item difficulty of 50%, one would not have good differentiation between test persons with low characteristic values ​​and no differentiation in the area of ​​high characteristic values ​​(ceiling effect). This means, for example, that people with above-average intelligence could solve all the tasks in an intelligence test if it did not contain tasks that are so difficult that only the most gifted can solve them. It would then no longer be possible to determine differences in the group of gifted people. In the case of level tests, the difficulty indices should spread over the entire range of the measured characteristic as far as possible in order to obtain the largest possible application range for the test. However, if the item difficulties are very different, the internal consistency of the scale also suffers, i.e. by answering an easy item it cannot be predicted whether a difficult item will be answered. Therefore, the construction of level tests with classical test theory is difficult.

Difficulty calculation for two-stage answers (e.g. true / not true):



= Number of "correct solvers", N = number of test subjects, p = difficulty index (only for two-tier answers!)

This represents a solution for the simplest case. If the test persons have not solved the task or it is suspected that the answers were partially only "guessed correctly", other alternative solutions must be used (cf. Fisseni, 1997, 41-42) .

Difficulty calculation for multi-level answers:

In this case p is not defined.

Possible solution to the problem:

  • Dichotomizing the item scores (e.g. 0 and 1), then calculating as two-stage with p.
  • Calculation of mean value and spread (mean value equivalent to p, but the spread must be taken into account).
  • = Index for multi-level answers:

simplified formula:


Various authors have proposed various calculations for a more precise calculation (cf. Fisseni, 2004, 43–45).

Differences in difficulty between two items can be checked using a multi-field table.

Strictly speaking, these formulas only apply to pure level tests, i.e. H. those that do not stipulate a test time limit and / or where test subjects were able to complete all tasks. If the latter is not met, as is often the case with performance tests, the number of "correct" answers must not be related to the total number of test subjects, but only to the number who actually worked on the respective task (cf. Lienert, 1989 ).

Selectivity

The selectivity of an item shows how well the overall test result can be predicted based on the answer to an individual item (Bortz & Döring, 2005). A high degree of discrimination means that the item is able to differentiate between the test subjects in the sense of the overall test (i.e. subjects with a high level of characteristics solve an item “correctly”, subjects with a lower level do not).

The selectivity is represented by the selectivity coefficient. This correlation coefficient between an individual item and the overall test score as a criterion is calculated for each individual item and is based on the scale level of the test values. If the test score is interval-scaled and normally distributed, the product-moment correlation between the values ​​per item i and the corrected total value t is selected as the selectivity ( ):

If = 0, an item is solved equally by subjects with high and low characteristic values. Unless negative power levels are justified by reversing the meaning of the item formulation (or scale), these items are considered unusable.

A priori, the highest possible absolute selectivity is desirable, but especially for level tests. The selectivity of each item depends on its difficulty, the homogeneity or dimensionality of the test, the position of the item within the test and the reliability of the criterion. (In addition to the test value, an external criterion can also be used as a criterion; in this case, it is also a validity coefficient .) The highest selectivity is found for items with medium difficulty (cf. Lienert, 1989).

homogeneity

The homogeneity indicates how highly the individual items in a test correlate with one another on average. In the case of high homogeneity, the items of a test record similar information (Bortz & Döring, 2005).

If all k test items are correlated with one another in pairs, the result is correlation coefficients ( ), the mean value of which (calculated via Fisher's Z transformation ) describes the homogeneity of the test.

The level of the item intercorrelations depends on the difficulty. The greater the differences in difficulty between the items, the lower the intercorrelation, which in turn influences the reliability of a test. As a rule, therefore, either uncorrelated (i.e. heterogeneous) items of the same difficulty or positively correlated (i.e. homogeneous) items of different difficulty are used for a (sub) test (cf. Lienert, 1989).

dimensionality

The dimensionality of a test indicates whether it only records one feature or construct (one-dimensional test) or whether several constructs or partial constructs are operationalized with the test items (multi-dimensional test) (Bortz & Döring 2005).

literature

  • Bortz & Döring (2005): Research Methods and Evaluation. Heidelberg: Springer-Verlag. ISBN 3-540-41940-3
  • Fisseni, H.-J. (1997): Textbook of Psychological Diagnostics. Göttingen: Hogrefe. ISBN 3-8017-0982-5
  • Lienert, GA (1989): Test setup and test analysis (4th edition). Munich: PVU. ISBN 3-621-27086-8

Individual evidence

  1. ^ A b c Hans Dieter Mummendey, Ina Grau: The questionnaire method: Fundamentals and application in personality, attitude and self-concept research . Hogrefe Verlag, 2014, ISBN 978-3-8409-2577-1 , pp. 97–98 ( limited preview in Google Book Search).