Experts classified each of the 20 selected items according to these categories. It was hypothesized that items classified as covering higher levels of proficiency should exhibit higher indices of item difficulty in the pilot test. However, as the empirical difficulty of a test item is shaped by a multitude of factors which can only partially be controlled for (e.g. the amount of previous knowledge required, the clarity of the item text, the mental workload involved, etc.), a certain amount of overlap between the pre-defined sets of items is inevitable. Previous work on proficiency scaling (cf. Watermann and Klieme, 2002) shows that sophisticated theories of item difficulty, operationalized by expert ratings of item demands, can explain between 65 and 80 percent of betweenitem- variance in item difficulty, when applied to large-scale assessment data.

The following two criteria are therefore realistic and should yield a satisfactory level of precision:

  1. Mean item difficulty should increase from level (1) to level (4).
  2. At least two thirds of the between-item-variance in difficulty should be explained by the experts' classification of items into the four proficiency levels.

5.5.2 Operationalizing the proficiency levels: The item classification

Level 1: 3 out of 20 items were classified as content-related tasks. These are rather concrete tasks with a limited scope of reasoning. They require the respondent to make simple connections, without having to check the constraints systematically. The respondent has to draw direct consequences, based on the information given and on her previous, content-related knowledge.

Thus, the mental operations that must be applied successfully to solve items at level 1 can be characterized as schemata of content-related thinking.

Level 2: Another 3 items were classified as corresponding to the second level. These items require the respondent to evaluate certain alternatives with regard to well-defined, transparent, explicitly stated criteria. The reasoning may be done step by step, in a linear process, combining information from the question section and the information section.

Thus, the mental operations that must be applied successfully to solve items at level 2 can be characterized as systematical (concrete logical) reasoning.

Level 3: 8 out of 20 items were classified as belonging this level. Some tasks require the respondent to order several objects according to given criteria. Others require her to determine a sequence of actions/events or to construct a solution by taking non-transparent or multiple interdependent constraints into account. This means that on level 3 the respondent has to cope with multi-dimensional or ill-defined goals.

Thus, the mental operations that must be applied successfully to solve items at level 3 can be characterized as formal operations. The reasoning process goes back and forth in a non-linear manner, requiring a good deal of self-regulation.

Level 4: The remaining 6 of the 20 items correspond to this level. These items require the respondent to judge the completeness, consistency and/or dependency among multiple criteria. In many cases, she has to explain how the solution was reached and why it is correct. The respondent has to reason from a "meta-perspective", grasping an entire system of problem states and possible solutions.

Thus, the mental operations that must be applied successfully to solve items at level 3 can be characterized as critical thinking and meta-cognition.