With a coefficient alpha of .81, the test consistency for the integrated short versions exceeded all the consistency indices for the long versions. The best alpha coefficient found in the pilot study for the long version of one single project was .76. However, the individual long versions of the single projects had a smaller number of items than the four short versions together. Estimating the alpha that would have resulted from a single 20-item project leads to a coefficient of .90. Thus, the integration of short versions of several projects results in a somewhat lower consistency than a test that is made up of the same number of items from one single project. However, the difference is not that large, and .81 is normally thought to be an acceptable alpha value.

From these findings it was concluded that the problem-solving test for the main survey should be based on the combined short versions of all projects. With regard to the range of item difficulties, the short versions showed a somewhat more restricted range compared to the long versions. Therefore, it was decided to add some of the items from the long versions to the short-version booklet for the final item selection, as described in the next section.

5.4 The final item selection

The final item selection was based on a number of criteria. First of all, the selected items were required to have good psychometric properties. Three parameters were focused on here: Difficulty, discrimination, and the item fit within the IRT-model. Items with extreme difficulty, weak discrimination properties and/or inadequate item fit within the IRT-model were eliminated. Secondly, it was crucial that the final set of selected items measured all proficiency levels of problem solving with the best possible distribution here as well. That also means that the overall range of difficulties was to be as large as possible and an even distribution of difficulties was aimed for. Third of all, the selected items were not supposed to show critical differences between national samples, unless of course these differences could be pinpointed to an operational error in the critical countries' material or procedure. Furthermore, some more technical aspects were also taken into account. It was attempted to keep the item types and question formats as balanced as possible. Time constraints were examined and adhered to, linguistic properties of the items were checked, and scoring problems resolved. It was also checked that the resulting projects still "made sense" from the point of view of the project's story or context.

Based on these criteria, 20 items were finally selected for the main study.

Figure 4 shows the distribution of item difficulties for those items that were selected (right) and those that were eliminated (left). As can be seen, the final version of the ALL problem-solving test covers the whole range of difficulties, with the exception of only one very difficult item. Mean difficulty was preserved from the pilot test versions to the final version of the test to be implemented in the main survey.