4. Criteria for inclusion of tests assessments in ALL

Having identified the domains in which the ALL study would attempt to develop assessments strict criteria were established for inclusion in the final international comparative assessment.

In keeping with the initial selection criteria, skill domains carried in the international assessment had to be related to key health, educational, social or economic outcomes. At this stage an additional domain had to explain at least an additional 10% of variance of at least one of these outcomes.

Second, the theory in each domain had to identify a set of variables thought to underlie the relative difficulty of tasks in the domain, a set that, a priori, had to theoretically explain most of item difficulty over the intended range of assessment described by the framework.

Third, empirical results had to demonstrate a high degree of agreement between item difficulties predicted from theory and those estimated empirically from pilot data. For the ALL study the agreement rate had to exceed 80% at the population level.

Fourth, empirical pilot results had to demonstrate that items were working in a psychometrically stable and equivalent for population sub-groups within countries and between countries.

Inter-country comparisons of percent correct, omit rates, not reached rates, biserial correlations and item response theory (IRT) parameters were examined to determine that they conformed to expected patterns. The mean deviations, and root mean square deviations of item characteristics curves were computed to ensure that items were functioning to an empirically defined tolerance both within and between countries.

Fifth, open ended items had to be scoreable to a very high reliability — within 97% or better inter-rate agreement within countries and 90% or better inter-rate agreement rates between countries — to ensure comparability.

Sixth, estimates of internal consistency of the test had to display an average rbiserial correlation in the range of 0.60, a level that assures that items are reliably measuring a same and single underlying dimension.

Seventh, assessment items needed to take little enough time to allow each respondent to take multiple items, a prerequisite to good statistical coverage of the construct and its covariance with background characteristics.

Eighth, assessment items had to cover the range of proficiency demonstrated by 95% of the target populations, thus assuring that there is no ceiling, nor floor effect.

Ninth, items had to discriminate proficiency over the range of difficulty/ proficiency displayed by the bulk of the population. In addition items needed to display good psychometric properties, particularly with respect to the stability of fit across languages and cultures and reasonably steep slopes.

Tenth, assessment items had to be culturally diverse, representing a broad range of cultures, languages and geographic regions.