7. Production and evaluation of items

The creation of items for the Numeracy assessment progressed through three stages: Two stages involving production of items and their testing in two countries on relatively small samples, and the third stage involved a much larger Pilot testing process.

7.1 Stage 1 (1998-1999): Production and field-testing of a first item pool

Based on the above general principles, a pool of over 80 items was generated by team members, based on their experience in research, assessment, and teaching with both school-based and diverse adult and workplace learner populations in several countries.

Production grid. Items were created so as to fill cells within an item production grid with four key dimensions that match the conceptual facets outlined in Table 1:

  1. Type of purpose / context: everyday, societal, work, further learning.
  2. Type of response: identifying or locating, acting upon (order/sort, count, estimate, compute, measure, model) interpreting, communicating about.
  3. Type of mathematical or statistical information: quantity, dimension, patterns/ relations, data/chance, change. The content of the tasks was also conceived, however, in terms of common school-based mathematics topics more familiar to policy makers and educators, i.e., whole numbers and basic operations; ratios, percents, decimals and fractions; measurement; geometry; algebra; and statistics.
  4. Type of representation of mathematical or statistical information: numbers, formulae, pictures, diagrams, graphs, tables, texts.

Scoring. Guidelines for scoring responses were designed to classify them into three general groups: "correct", "any other response" (i.e., wrong answers) and "not attempted" (i.e., no indication the respondent tried an item). However, for many items, multiple codes were prepared to capture different types of "correct" or "wrong" answers and thus enable an analysis of error patterns and shed light on the extent to which instructions are understood and items elicit the expected type of responses. In some items that require estimation or measurement, multiple codes were prepared to capture responses that may have different degrees of accuracy yet still fall within a "correct" or "wrong" region, in order to understand the level of accuracy that respondents adopt.

Non-cognitive items. Research literature suggests that the way in which a person responds to a numeracy task, including overt actions as well as internal thought processes and the adoption of a critical stance, depend not only on knowledge and skills but also on negative attitudes towards mathematics, beliefs about one's mathematical skills, habits of mind, and prior experiences involving tasks with mathematical content (Cockcroft, 1982; Lave, 1988; Schliemann and Acioly, 1989; Saxe, 1991; McLeod, 1992; Gal, 2000). Hence, the Numeracy team also prepared several scales for the Background Questionnaire, with questions designed to measure numeracy practices at home and at work, attitudes and beliefs about mathematics, and information about the environment in which the respondent learned mathematics while in school. Such scales may help in explaining performance on numeracy tasks, as well as understanding respondents' status on variables of interest to policy makers, such as participation in further learning or employment status.