First feasibility study. Of the items generated at this stage, 80 items were tested in a feasibility study conducted in the USA and The Netherlands, on samples of about N=300 per country; each item was answered by about 150 respondents. Results enabled analysis of error patterns and gender bias on items, and assessment of psychometric properties of items in terms of both classical test theory and Item Response Theory (IRT) parameters. Comments made by respondents in focus groups suggested that only a few items were at times misunderstood. Statistical analyses showed that most items have adequate psychometric properties, are answered roughly in the same way by males and females, and that the items tested cover a wide range of difficulty levels. (Four items from the QL scale used in IALS were also tested to enable rough calibration of the item difficulty estimates obtained on the basis of the feasibility sample, in light of the difficulty levels of these four items in the much larger and nationally representative samples used in IALS).

The feasibility study enabled evaluation of several other important issues that arise during item translation and adaptation to different cultural contexts, e.g., in terms of using different monetary, length, or volume units. Results showed that performance on most items is comparable across languages even when they use different units of measurement. Also, responses were scored independently by two scorers, and an analyses of scorer agreement showed very adequate scoring reliability, suggesting that the scoring schemes and scoring instructions were well understood to scorers operating in two languages.

Further analyses revealed interesting patterns of correlations between respondents scores on some of the non-cognitive scales and their overall performance on Numeracy items. This suggested that some of the attitudinal and belief items can indeed be used to help in understanding cognitive performance.

Overall, 68 items out of the 80 items tested appeared to satisfy all selection criteria. For these 68 items, it was found that item difficulty (in terms of percent correct on an item) was highly correlated (r = -0.793) with predicted item complexity as determined during item development based on the factors outlined in Figure 2 and detailed in Appendix 2. This indicated that the complexity scheme could serve as a useful aid to inform the direction in which features of new items should be varied so as to reach a pre-determined or desired level of difficulty. The detailed scheme included in the appendix is a refinement of the original and continues to represent work in progress. Because of the recursive nature of the testing of this scheme (e.g., the same individuals wrote the scheme and rated the complexity of items), caution should be exercised in further interpretive use of the present version; further refinement and validation work is necessary.

External review. As part of work at this stage, the conceptual framework and some sample items were sent for review and comment by a panel of 16 experts from 9 countries. The reviews highlighted that there is a range of conceptions regarding the terrain covered by the term "numeracy", as expected in light of the conceptual analysis summarized in Part A. Nonetheless, the reviews overall supported the conceptual framework for Numeracy assessment developed for the ALL project, and endorsed the approach to item production described earlier in this section.

7.2 Stage 2 (1999-2000): Additional item production, second feasibility study

Following the successful completion of Stage 1, additional items were contributed by experts from four countries: Austria, the Czech Republic, Hungary, and Sweden, in order to enrich the pool of Numeracy items and increase its cultural diversity. Some of these items were selected and adapted by the Numeracy team in order to fit the item production grid and the item development principles.