Methods from classical test theory as well as advanced models from item response theory (IRT) were applied. In addition, experts classified the items and rated item features. Most of the analyses were done in preparation of the final item selection and revision. Therefore, some of the results presented below are based on preliminary data sets and preliminary scoring procedures. However, comparisons carried out after the final test analysis revealed that the results are stable. For example, the correlation between our first version of item difficulty parameters and the final parameters, generated at Educational Testing Service (ETS) after the item selection and scoring had been finalized, is .92.

The results of the pilot test are quite conclusive. In the following, the relevant criteria and analytical procedures for each of the four issues will be described, followed by a short presentation of the corresponding findings from the ALL pilot study. Thus, it will become clear how the pilot results were used to select a final set of instruments and to develop an optimal instrument for the assessment of analytical problem solving.

5.2 A unique, common scale for analytical problem solving

5.2.1 Criteria and expectations

The matrix design of the field trial allowed for an integrated analysis of the long versions of all four projects. Thus, it was possible to estimate the latent (error-free) correlations between the projects. We expected these latent correlations to be around or above .90. Correlations of this size would indicate that the four projects could, in fact, be interpreted as building blocks of a single, common latent dimension.

The classical approach to test and item analysis — calculating item-test-correlations and estimating test reliability by the so-called coefficient alpha — could be applied to the combined short versions of all four projects (I+J+K+L), as these 18 items were administered to the same group of respondents. According to standards of item construction, an alpha coefficient above .80 would indicate that all the items — regardless of the different project contexts — make up a single, consistent dimension.

5.2.2 Findings and conclusions

The calculated pair-wise latent correlations between the different blocks ranged from .925 to .959. The combined short versions show a sufficiently high consistency (alpha = .81; part-whole-corrected item—test-correlations from .23 to .55 with a median of .38).

Thus, we can conclude that the items from all four projects form a common latent dimension, i.e. the analytical problem-solving scale. This is true both for the long and short versions of the problem-solving instrument This finding is very much in line with results from earlier implementations of the project approach (Ebach, Klieme and Hensgen, 2000, Klieme et al., 2001), where structural equation models (SEM) showed that problem-solving tests based on the project approach make up a unique dimension. This result has important consequences for the validity of the ALL problem-solving scale. It shows that the problem-solving test does not merely measure the ability to cope with certain special, context-dependent planning problems. Instead, the items actually do tap a general competency for analytical reasoning and decision making in complex situations where problem solving is required.