10. Results10.1 ItemsWith the ultimate goal of selecting 40 items for the Main Survey, the Numeracy team first looked at the Pilot data to identify items that may be problematic in terms of psychometric anomalies or scoring unreliability. Some of the key patterns or data examined were the following:
Based on such considerations, the team rejected very few of the 81 items, as gross problems were mostly already eliminated after the two feasibility studies and translation and adaptation process. A few problems with specific items were revealed and addressed by technical recommendations. For example, instructions to use the ruler on certain items were made more explicit both for the respondent and the examiner. Recommendations were made on how to standardize the production of stimuli which required that respondents perform a measurement of length but which were not printed to the same exact measurements in different countries. In addition, the complex scoring rubrics developed and used up to the Pilot study for error analysis were collapsed into a simpler correct/incorrect classification. This approach was chosen in light of the advantages in terms of simplifying scoring processes and scorer decisions and after analyses showed no significant loss in the information provided about respondents' skills for the purposes of this survey. 10.2 ScaleThe pilot results reaffirmed that the Numeracy items as a whole constitute a cohesive scale. Across all the countries, the mean Chronbach's Alpha coefficient, a measure of the internal consistency of a group of items intended to represent an underlying construct, averaged 0.88 across the four blocks of Numeracy items. A result above 0.80 generally indicates an underlying consistency in a scale designed to measure cognitive skills. An average Alpha of 0.88 is especially informative, considering that the Numeracy items encompass a range of facets of content, and vary in terms of their contexts, literacy demands and response requirements. Additionally, the construct as defined was validated by the strong correlation between the predicted and observed difficulty of the items in the pilot. With a wider and more diverse population than the feasibility study, the pilot study empirical results (percent correct) were highly correlated (r = -0.799) with the theoretical predictions of difficulty determined with the complexity scheme (Figure 2 and Appendix 2). Again, the reader is advised that further validation and refinement of the detailed scheme is planned. |
Previous Page | Table of Contents | Next Page |