Methodology of the PISA studies

from Wikipedia, the free encyclopedia

The methodology of the PISA studies is the procedure on which the PISA studies are based. PISA is carried out on behalf of the OECD and with the participation of various advisory committees by a consortium of companies from the test industry . National project centers are involved in the participating states. Approximately 5000 students are tested in each state.

The test consists of a two-hour "cognitive" test session, followed by a questionnaire session lasting just under an hour. In the cognitive test, not all students work on the same tasks; In 2003, thirteen different test books (as well as a short book in some countries in special schools) were used; Out of a total of 165 different tasks, each individual student only had about 50 to deal with. The student solutions are coded by trained assistants, recorded digitally and transmitted to the international project center in Australia for further analysis. Most tasks are ultimately only rated as either "wrong" or "right". Depending on how many students have solved a task correctly, the task is assigned a certain "difficulty value". Depending on how many tasks a student has solved, the student is assigned a certain range of "plausible" "competence values". The difficulty and competence value scales are subsequently scaled so that the competence values ​​in the OECD country mean have the mean value 500 and the standard deviation 100. In order to compensate for the fact that the test books were of different difficulty and that individual tasks could not be evaluated in individual countries, for example due to printing errors, the entire "scaling" of the difficulty and competence values ​​is carried out with the aid of a complex mathematical model of student response behavior, the so-called Item response theory calculated.

The task difficulty values ​​allow a didactic interpretation of the test results: if a student has achieved 530 competence points, for example, then there is a 62% probability that he can solve a task with a difficulty of 530 (the number 62% has been determined arbitrarily). If you look at published examples of exercises with a difficulty value in the vicinity of 530, then you get an impression of what a competence value of 530 means. However, you have to bear in mind that the test takes place under considerable time pressure (just over 2 minutes per task). Almost all further evaluations are based on examining the statistical distribution of student competence scores in the participating states or more finely disaggregated populations.

Preparation, implementation and evaluation are described in hundreds of pages of technical reports and evaluation manuals.

Project management

PISA is one of several projects with which the OECD has been increasingly involved in educational monitoring since the 1990s. The coordination and final editing of the international reports are the responsibility of a small working group at the OECD headquarters in Paris, headed by the German Andreas Schleicher . Politically, the project is steered by a council of government representatives; scientifically, it is accompanied by a committee of experts and sub-committees; these experts, didacticians and educational researchers, work particularly in the selection of test items. The creation and evaluation of the test tasks was put out to tender and awarded to a consortium of several companies in the test industry . The lead was given to the Australian Council for Educational Research (ACER) - the institute at which Schleicher trained from a physicist to an educational statistician.

In the individual participating states, test implementation, evaluation and publication of results are carried out by a national project partner. In small states, these are small working groups, often with fewer than five members; In Germany, a few ten people are involved with PISA, not least because the supplementary study PISA-E is much more expensive than the German contribution to PISA in the narrower sense (with I = international).

  • In Germany , PISA 2000 was coordinated by the Max Planck Institute for Human Development (MPIB) under the direction of Jürgen Baumert in Berlin. For PISA 2003 and 2006, the project management was at the Leibniz Institute for Science Education (IPN) under the direction of Manfred Prenzel in Kiel. From 2009 PISA was carried out by the German Institute for International Educational Research (DIPF) in Frankfurt am Main under the direction of Eckhard Klieme . From 2012 PISA will be taken over by the newly established Center for International Comparative Educational Studies (ZIB) under the direction of Manfred Prenzel and later Kristina Reiss .
  • For Liechtenstein and Switzerland , PISA is coordinated by the Federal Statistical Office in Neuchâtel. There you can also read that the national implementation of a three-year study results in 3 million SFr project costs (personnel costs, fees, travel costs, contributions to international coordination - but not including the salary shares of permanent employees who use part of their working time on PISA).
  • In Austria , PISA is coordinated by the Project Center for Comparative Educational Research ( ZVB ) in Salzburg under the direction of Günter Haider.
  • South Tyrol takes over the test booklets from Salzburg and has them coded there after the test before the data is then forwarded to the Italian project center in Frascati. In the international report, the results of South Tyrol, like those of some other (hardly coincidentally without exception, economically well-off northern regions) of Italy, are shown separately, although the sample size actually required for this was not achieved.
  • In Luxembourg , from 2003, students could choose between a German and a French language test booklet; the vast majority had themselves tested in German. The project management consists of a tiny working group in the Ministry of Education.

preparation

The process of evaluating the test items was accompanied and influenced by each participating country and ranged from the development by the international PISA development team, translation into the languages ​​of the participating countries, the evaluation of each individual item by curriculum experts and pre-tests in each participating country to the rapid Scaling. The complete evaluation process can be found in the technical report. The school and pupil samples were chosen so that, according to the current state of research, they are as representative as possible of the respective national population.

Test execution

43 countries took part in PISA 2000; however, the official publications only report data for 32 states. Around 180,000 students were tested in these states: between 4,500 and 10,000 per state. In Liechtenstein, Luxembourg and Iceland the sample included the entire 15-year-old population.

The students do not all work on the same tasks. To improve the quality of the data (and at the price of additional scaling), a study comprises nine exercise books ( test booklets ), of which each student only works on four ( rotated test design ). Following the four thirty-minute task edits each student fills out an extensive questionnaire ( questionnaire ) made especially to its socio-economic background. The additional study on self-regulated learning was carried out in 2000 using questionnaires. The 2003 problem solving investigation also included test items.

Data acquisition and processing

All of the student's answers are translated into code letters or numbers by specially trained staff and entered into a computer. All data sets are given to a subcontractor (the Australian statistics institute ACER) for scaling. The level of difficulty of the individual subtasks ("items") is first determined from the students' answers (and only those from the OECD member states). The scaled data are then returned to the national project groups, which evaluate the data in detail. The OECD and national project groups publish their first results in the year following the test.

After publication of the first results, the data sets (with the exception of a few keys, in Germany for example the federal state and school type) will also be made available to external researchers: Original student responses and scaled student data can be downloaded from ACER, but, as can be seen from the associated manual, are only available usable for specialists. An independent didactic interpretation is not possible because the student answers in the published dataset are only considered <correct | wrong | not processed> coded and the tasks are not available.

The published solutions to tasks suggest that when recording student responses to multiple-choice questions (in contrast to tasks with a different answer format), no distinction is made between “wrong” and “not processed”. The coding manual indicates, however, that this information is incorrect and that it is well coded in the international raw data set whether an answer was given and, if so, which. In the absence of clear statements, however, one must assume that the official data preparation (see next section) did not differentiate between incorrect (i.e. possibly guessed) and non-given answers - in contrast to other standardized tests (e.g. SAT ), where incorrect MC - Responses will be sanctioned with deduction of points.

tasks

With the help of copyright law , the PISA consortium manages to keep the tasks set worldwide secret. The secrecy is necessary in order to be able to reuse individual tasks in follow-up studies, which in turn is necessary in order to relate the difficulty scales to one another.

Only a few sample exercises have been published, and the same in all languages. Some of the approved tasks stem from preliminary examinations that were not used in the main round due to certain deficiencies; one task (“Antarctica”) only turned out to be unsatisfactory in 2000 in the main round.

evaluation

The evaluation of the PISA study is based on mathematical models that make it possible to describe task difficulties and student competences on the same performance scale ( Rasch model ; see also Rost, J. (2004). Test theory . Bern: Huber.). This scale was arbitrarily chosen so that the student competences of the entire OECD sample (excluding partner countries) have a mean value of 500 and a standard deviation of 100. This means that students with a capacity of 400, 500, 600 are more productive than 15.9%, 50% or 84.1% of all OECD students. As a result of the inclusion of Turkey with its low values ​​in the calculation of the OECD mean value for the first time in 2003, the value of all other countries has improved by 3 points compared to 2000, without any substantive contribution to this improvement in these countries. If the countries were weighted according to the number of pupils in the tested year in the averaging process, further such “improvements” could be achieved.

A similar scale construction is known from IQ tests , the mean value of which is 100 and the standard deviation is mostly 15, the conversion factor therefore 100 to 15 = 6.67 for the deviations from the PISA mean value 500. In the opinion of the educational researchers, the tasks from the PISA Tests, however, have nothing to do with IQ tests, and they are therefore reluctant to convert them into IQ values ​​( criticism ).

The performance scale of the PISA studies is constructed in such a way that the student abilities are normally distributed with a mean value of 500 and a standard deviation of 100. Such normalization (with a mean value of 100 and a standard deviation of mostly 15) has long been common in IQ tests.

In fact, PISA does not use one , but three performance scales for the three competence fields of reading, mathematics, and science. In addition, sub-scales are formed for the field of competence that is examined in depth in one session. In PISA 2000 reading competence was subdivided into “determining information”, “interpreting textually” and “reflecting and evaluating”; In PISA 2003 there are four subscales for the focus on mathematics: “Space and Form”, “Change and Relationship”, “Quantity” and “Uncertainty”.

However, all competences and sub-competencies are highly correlated and they can easily be averaged. A summary rating on a scale is not found in any of the official publications; However, it was produced by some press organs in order to be able to present PISA even more strikingly as a quasi-Olympic comparison of countries.

It is postulated that the difficulty of the task and student competence determine the probability of a solution. A task i , for example, has the difficulty ξ i = 550, if a student ν with the ability σ ν = 550 can solve this task with “sufficient confidence”. It is arbitrarily defined that “sufficient certainty” means a solution probability of 62%.

As part of the evaluation, both the task difficulties and the student competencies must be determined from the student data sets. This evaluation depends on model assumptions ( Item Response Theory ), is extremely complicated and poorly documented. The official description in the technical report (p. 99ff.) Is kept very general. No concrete numerical values ​​are given for the model parameters; it is not even possible to open up the dimension of important vectors. The following can be read out reasonably reliably:

500 students from each of 27 OECD countries will be drawn. It is assumed that the latent abilities (for PISA 2000 the performance measures for math, science and reading three times) are multivariate normally distributed among the 13,500 students in the sample . With this assumption, the coefficients of an item response model can be calculated, which describes how difficult it is for a subject with a certain ability profile to perform a certain subtask.
The ability profile of the student ν is a vector σ ν , the five components of which are the sub-competencies in mathematics, science and reading three times. The task difficulty ξ i is described in this part of the technical report as a vector (with unknown dimension p ), but everywhere else as a scalar.
We now know the probability with which a certain ability vector will result in a certain response behavior. The task, however, is the other way around, inferring skills from the actual response pattern. There is no clear way of doing this. In the scaled student data sets, two paths are taken to provide an approximate indication of student abilities: (1) The most probable ability values ( maximum likelihood estimates) are given. However, these values ​​are not suitable for characterizing larger populations. (2) So-called plausible values are given: for each of the 180,000 test subjects, five exemplary ability vectors are drawn with the help of random numbers, the drawing being controlled in such a way that the measured response patterns are reproduced when averaging over a sufficiently large population. It makes sense to carry out all further analyzes based on this data set five times with one instance of the ability vector per student; By comparing the five final numerical results, one can finally assess the uncertainty caused by the use of random numbers.

To characterize certain subpopulations, for example by country, by gender or by socio-economic criteria, one simply creates mean values ​​using the “plausible value” ability values ​​of the individual students.

Official interpretation: skill levels

The official publications attach great importance to the qualitative interpretation of the quantitative results with the help of so-called competence levels. This is necessary because the scores don't tell us anything about content. We don't know, for example, how many (and which) tasks a Finnish student has solved compared to a German student. These competence levels are based on a priori characterization of the tasks and the frequency of solutions measured on processing. In mathematics didactics , a heated argument has broken out over whether such a construction is even possible. The line of argument is that the different ways of solving the tasks make it impossible to clearly assign a content difficulty to a task. The competence levels cannot be constructed in terms of content (compare e.g. Journal für Mathematik-Didaktik, Issue 3/4 - 2004, 1 - 2005, 3 / 4-2005).

Individual evidence

  1. ^ PISA 2000
  2. PISA 2003 and 2006 ( Memento of the original from June 17, 2007 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / pisa.ipn.uni-kiel.de
  3. BMBF press release 182/2010 of October 14, 2010 ( Memento of the original of October 26, 2010 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / www.bmbf.de
  4. ^ Center for International Comparative Educational Studies (ZIB). Kultusministerkonferenz , January 17, 2017, accessed on November 12, 2017 .
  5. (also technical report)
  6. [1]
  7. Archived copy ( memento of the original from June 13, 2007 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / pisaweb.acer.edu.au
  8. [2]
  9. PISA 2000 Technical Report ( English ) OECD. Archived from the original on July 15, 2009. Retrieved September 9, 2019.