10. Results

10.1 Items

With the ultimate goal of selecting 40 items for the Main Survey, the Numeracy team first looked at the Pilot data to identify items that may be problematic in terms of psychometric anomalies or scoring unreliability. Some of the key patterns or data examined were the following:

  • Wide variation in performance on an item from country to country or across gender groups could mean that the context of the item was not universal or it could reveal discrepancies in either adaptation of items or test administration.
  • When the observed performance on an item (for example, "percent correct" in classical test theory terms) deviates significantly from the difficulty level predicted from the theoretical complexity scheme (See Part A), it could indicate misunderstanding of the question or the presence of unexpected factors that cause response errors in some but not all countries.
  • A large number of disagreements between scorers for an item could indicate that a scoring rubric was not discriminating properly. In addition, anecdotal reports from scorers on the listserv can be used to flag a few items whose scoring rubrics were difficult to score for various reasons.

Based on such considerations, the team rejected very few of the 81 items, as gross problems were mostly already eliminated after the two feasibility studies and translation and adaptation process. A few problems with specific items were revealed and addressed by technical recommendations. For example, instructions to use the ruler on certain items were made more explicit both for the respondent and the examiner. Recommendations were made on how to standardize the production of stimuli which required that respondents perform a measurement of length but which were not printed to the same exact measurements in different countries. In addition, the complex scoring rubrics developed and used up to the Pilot study for error analysis were collapsed into a simpler correct/incorrect classification. This approach was chosen in light of the advantages in terms of simplifying scoring processes and scorer decisions and after analyses showed no significant loss in the information provided about respondents' skills for the purposes of this survey.

10.2 Scale

The pilot results reaffirmed that the Numeracy items as a whole constitute a cohesive scale. Across all the countries, the mean Chronbach's Alpha coefficient, a measure of the internal consistency of a group of items intended to represent an underlying construct, averaged 0.88 across the four blocks of Numeracy items. A result above 0.80 generally indicates an underlying consistency in a scale designed to measure cognitive skills. An average Alpha of 0.88 is especially informative, considering that the Numeracy items encompass a range of facets of content, and vary in terms of their contexts, literacy demands and response requirements.

Additionally, the construct as defined was validated by the strong correlation between the predicted and observed difficulty of the items in the pilot. With a wider and more diverse population than the feasibility study, the pilot study empirical results (percent correct) were highly correlated (r = -0.799) with the theoretical predictions of difficulty determined with the complexity scheme (Figure 2 and Appendix 2). Again, the reader is advised that further validation and refinement of the detailed scheme is planned.