Introduction

To what extent are international rankings of literacy achievement across countries and regions accurate? Given the centrality of basic reading and writing skills for higher level learning, such international rankings appear to be critical as global markers of educational quality. Indeed, most governments prioritize high levels of literacy. Presumably, some of their willingness to participate in international literacy surveys is attributable to the fact that they strive to understand and track the literacy skills of their populations. After all, with the rapid rise of international literacy assessments and their influential role in education policy, reading proficiency of different countries is promoted as an objective proxy of the effectiveness of their education systems. In this review, we consider the question of validity of international literacy measures using the example of PISA through the lens of dyslexia.

Thinking about validity can be frustrating, and trying to do something about validity can be even more frustrating,” Kane (1992, p. 230) candidly stated. Trying to evaluate validity arguments of an international ranking on PISA reading and trying to say something about dyslexia can be equally frustrating. The former involves reading achievement of adolescents indexed by countries or economies and the latter centers on word reading difficulties. International large scale assessments can be a useful avenue to unpack research in reading impairments across languages. In most educational assessments, the test consumer focuses on interpretation and uses of scores while the test developer emphasises examining both the instrument and the theory that informed its construction. Literacy researchers with the goal of educational equity have the obligation to focus on both these roles.

Definitions of dyslexia vary widely across countries (e.g., McBride, 2019), and reading skills are also associated with economic factors at the country, school, and family levels (e.g., Chiu & McBride-Chang, 2006, 2010). Thus, we must scrutinize the extent to which international literacy comparisons are valid given the myriad of differences across countries and territories. We offer suggestions to consider in understanding the complexity of reading scores, including the poorest readers.

According to the International Dyslexia Association (2016), the global prevalence rate of dyslexia is approximately 15–20%. The variation of dyslexia prevalence in the alphabetic writing systems ranges from as low as 2% (Fluss et al., 2008; Miles et al., 1998), to an average of 12–15% and as high as 17.5% (Peterson & Pennington, 2012; Shaywitz, 1998) and 19.90% (Jiménez et al., 2011; Prior et al., 1995). One of the earliest reviews of dyslexia in non-alphabetic script was amongst Chinese and Japanese readers (Stevenson et al., 1982). Since then, there has been a gradual interest and increase in dyslexia research in Asia. In Chinese-speaking school children, for example, a dyslexia prevalence rate of 3.0–12.6% (Gu et al., 2018) is reported, with one estimate of 9.7% in Hong Kong (Chan et al., 2007). Globally, the rate of dyslexia is about 9.7% (Sharma & Sagar, 2017). Yang et al., (2022) analysed published results from the 1950s to June 2021 and estimated an average rate of developmental dyslexia to be at 7.10% (7.26% in alphabetic scripts and 6.97% in logographic scripts). Despite the fact that studies on the prevalence rate of dyslexia are common, it is important to note that across countries and regions, the definitions and identification of dyslexia may vary greatly, thus potentially rendering comparison of prevalence rates somewhat meaningless (see McBride, 2019).

Since Elley (1992) asked How in the World Do Students Read?, the Programme for International Student Assessment (PISA) reading literacy data have become ubiquitous as a default reference for how well students are reading and where they stand in comparison to their peers from other countries. We focus here on PISA results because they are among the best-known literacy comparisons. Given the tradition of such global literacy indices, to what extent are they comparable? As prevalence rates of dyslexia across countries and regions tend to be difficult or impossible to compare, how can ranking of reading ability be done meaningfully?

The aim of PISA reading literacy and the validity argument

The major goal of PISA is to assess whether adolescents who are of an average age of 15-years old upon completion of at least 6 years of formal compulsory schooling can apply what they have learned in school in real life situations (OECD, 2019a). What are the OECD arguments for the comparability of PISA across cultures? First, PISA reading adapts a broad literacy approach (Hopfenbeck et al., 2018); hence, the content is independent of the curricula mandated by a specific school board or government. Second, there are uniform testing conditions enabled through standard training that all test administrators across the countries undergo. Third, the adaptation and translation of test items is done to ensure linguistic equivalence (McQueen & Mendelovits, 2003).

Despite all of these, measurement invariance persists as a problem for valid cross-lingual comparisons (Huang et al., 2016; Padilla & Benítez, 2014). What are some of these difficulties? Measurement invariance or equivalence assumes that the psychometric properties of PISA are equal (i.e., invariant or equivalent) across groups (i.e., countries, economies, languages, scripts). This equivalence seeks to answer the questions of whether the reported rankings based on observed score differences are due to (i) actual differences in reading ability or efficacy of school systems or (ii) differences in how the test measured reading across different contexts and languages. Without measurement equivalence, group comparison can lead to incorrect conclusions and inference (Chen, 2007). Addressing this issue is essential to get a comprehensive and culturally valid framework of literacy spanning from the basic process of decoding to that of the more complicated one of comprehension and inference making.

Recognition of the relationship between literacy achievement and language development warrants the examination of measurement invariance (Asil & Brown, 2016; Oliveri & von Davier, 2011) and ecological validity of the measures (Papadopoulos et al., 2021). Measurement equivalence includes construct equivalence, test equivalence, and equivalence of testing conditions (Ercikan & Lyons-Thomas, 2013). Thus, there is a need for language-specific consideration in item construction and selection to ensure comparability. Item level equivalence has been examined in similar international assessments such as the Trends in International Mathematics and Science Study (TIMSS) (Ercikan & Koh, 2005), science items and comparability of the National Assessment of Educational Progress (of the United States), and the PISA (Stephens & Coleman, 2007). In the traditional approach of factor analysis used in such assessment, the goal of the primary level of invariance is to establish that indicators of the construct (that is, reading proficiency) are loaded on the same items across groups. However, item-level and scale-level comparability and consistency are difficult to attain for two main reasons. First, such a configuration requires a compelling theoretical justification of reading ability and its processes and experiences as almost equivalent across countries and languages. Yet, this assertion of a universal model of reading remains questionable (Plaut, 2012). Second, evidence of such configural invariance is best generated with multi-methods research (Luong & Flake, 2022) which is rarely a norm. If these fundamental measurement issues persist unaddressed, the claims of cross-contexts literacy achievement or underachievement are less compelling scientifically and less informative for classroom instruction.

There is still a major disagreement about sources of validity evidence required for justifying inferences and uses of test scores (Cizek et al., 2008). This could be because intended score meaning and intended test uses (Cizek, 2012) are incompatible as definitions of validity. To begin with, there are multiple perspectives of validity. This is taken to indicate that researchers’ understanding and application of validity are less standardized, perhaps impeding generalizability. The foundational psychometric framework of construct validity proposed by Meehl and colleagues was formally incorporated in the Technical Recommendations of the American Psychological Association (1954). According to Cronbach and Meehl (1955), construct validity is to be treated as an evaluation program with explicit theoretical definition of the construct reflected in the measures developed and used. This shifted the original emphasis of construct validity on internal consistency (Thurstone, 1952) to that of validity in terms of interpretations (Cronbach, 1971). The obligation of test developers and users are taken to be inherent in this process (Cronbach, 1988). Messick (1989) proposed a unified view of construct validity with an evidential basis of justified interpretations and usage of the test scores and a consequential basis of value implications and societal outcomes. Taking forward the validity as argument approach of the early proponents such as Cronbach, the perspective of Kane (1992, 2016) has, as discussed in the following paragraph, emerged as the most influential in educational assessments particularly for the large scale measurements (Chapelle et al., 2010). Using PISA scores, we seek to discuss the challenges of test construction and the validation process of reading as a construct across levels of proficiencies and languages. At the outset, we concede there is no doubt that the OECD has highly trained measurement experts who are involved in the sophisticated methodologies of the PISA (Berliner, 2020; Takayama, 2018).

Reexamining the construct validity basis of the PISA reading measure

The validity evidence of PISA is purportedly informed by the argument-based approach of Kane (OECD, 2018a). The fundamental target of Kane’s approach is to put forth different evidence of validity, alternative interpretations and hypotheses (Kane, 1992) when there are diverse contexts and stakeholder (Kane, 2001). The goal is to accommodate the potential of change in the degree of validity in light of new evidence from different contexts, a possible scenario for international assessments. However, the validation procedure in PISA does not necessarily conform to this intention and the claim of the argument approach. As noted by Addey et al. (2020), the validity practice of PISA measure contradicts this approach because, according to Kane (2013, 2016), validity is a framework of justified interpretation and uses of test scores, and not an intrinsic feature of the test or its performance.

Construct validity of the PISA (OECD, 2019b) has been primarily established on the basis of (1) the defined purpose of the instrument, (2) evidence from field trials of items testing the PISA theoretical framework, and (3) the adherence to all technical standards throughout the process of assessment. In the ‘National Project Manager Manual’ (OECD, 2019b) of PISA 2021, a similar argument of technical standards as validity evidence is put forward. However, the actual process and the final documentation of validity arguments is a prototype of assembled validity (Addey et al., 2020). This means that the cumulative practice of generating and establishing evidence of validity, starting from the field trials to the publishing of the report involves constant negotiation. The different stakeholders have varying levels of socio-political power and material resources. Of special relevance is the instance of PISA for Development (PISA-D) for low- and middle- income countries first launched in 2013 (OECD, 2016a). Addey et al., (2020) distinguished between authorized validity arguments and unauthorized ones highlighting that the crucial discussions of (PISA-D) items have, for the most part, not happened in the official meetings, but in unofficial conversations. India’s revoking of the decision to participate in PISA is also a relevant instance. Only two states- Tamil Nadu and Himachal Pradesh participated in it, an event which was declared by the OECD to ‘..not meet the PISA standards for student sampling’ (Bloem, 2013; p. 21). Accordingly, the results from these data were rejected by the Indian Government because the PISA test was not appropriate for the diverse contexts of the country (Chakraborty et al., 2019). If negotiation fails or new data contest the existing status of the scientific instrument of reading (OECD, 2018b), how does such a singular event inform the attempt to gather new evidence of validity?

The OECD (1999) claims that most PISA items were developed in English because of practical reasons. Indeed, the PISA reading test items have two original source versions—English and French. All translations in other languages are done based on one of these two languages. Compared to the original English version, the Finnish version was on average 8% longer, the Irish version 11% longer and the German version 17% longer in one study (Eivers, 2010). Moreover, these translated measures are judged to be more comparable if they share linguistic (Grisay et al., 2009) and geographical (Grisay & Monseur, 2007; Kankaraš & Moors, 2014) proximities with the original source. Out of the 30 countries that submitted the reading items for PISA 2009, only two are from Asia—Korea and Macao, China (OECD, 2009). Takayama (2018) noted that the Reading Expert Group (REG) of this same year included two representatives from Asia—Japan and Korea. The same nominee from Japan was also a part of the PISA 2000 Reading Function Expert Group (OECD, 2001). Though this individual was intended to be the expert from outside Europe and North America, this aforementioned representative was described as having ‘no expertise in reading at all’ (Takayama, 2018; p. 226). In such an evidently skewed capture of expertise, arguably reflected in the PISA framework of literacy, relooking at what poor reading and proficiency reading actually mean in both a practical and theoretical sense are warranted. If assessment aims to inform educational policy and practice, a dyslexia lens needs to be inherent in this framework. For example, there is no definitive answer as to “What 15-year-old students in Malaysia know and can do” (OECD, 2018c) based on a reading measure unless we have clear evidence as to how emergent word decoding contributed to this knowing and doing.

As mentioned above, the measurement invariance of the PISA test is consistently questioned (see Arffman, 2013). For example, Söyler et al. (2021) found there were substantial differences between the item threshold and factor loading of the test items in the PISA 2015 reading test between countries that tested native English and non-native speakers. Scores from these two groups–the first one including Canada, the USA, and the UK, were compared with that of the second, namely, Japan, Thailand and Turkey; the authors reported that eight of the twenty-eight items had limited invariance. In light of such significant invariance, what is the inference of reading and educational attainment we can make from the international ranks of these countries?

The PISA ranking analysis is based on the marginal maximum likelihood (MML) estimates of item and population parameters. This estimation assumes the normal distribution of reading ability as a latent variable. However, the participating countries differ markedly in languages and educational policies (Kreiner & Christensen, 2014). The scaling of PISA items follows the Rasch model. Thus, it is the basis of the person parameter estimation. This model assumes that a unidimensional trait gives rise to the scores and most importantly, all items in each country have the same difficulty level (Berliner, 2020). Kreiner and Christensen (2014) reported negative estimates of the Rasch parameter for all countries. This indicates that the assumption of a unidimensional trait—the difference between the student’s reading ability and the difficulty of the item—giving rise to the scores requires more scrutiny. Moreover, they used the conditional likelihood ratio (CLR) test and demonstrated that PISA reading item parameters are not the same in all countries.

In addition, the imprimatur of objective assessments purportedly co-constructed with local experts is often used to substantiate the contextual claims of PISA (Lockheed et al., 2015). “It is an international assessment, so we cannot shape it very much” (Gorur et al., 2019; p. 319) was a comment made by an official of the Research and Test Development department of The Zambian PISA-D implementation team. With primary focus on standardised procedure of test constructions and adaptations, the need of contextualization prevails as an unresolved tension (Gorur et al., 2019). For researchers studying reading development across languages, this can be reframed as two salient challenges. Conceptually, how and when do word reading difficulties converge with comprehension impairments? Will it replicate across languages, socio-cultural and educational contexts? Methodologically, how can we refine the psychometric measures to capture the construct comprehensively so that variation will not deter the larger scientific goal of generalization. We address these by presenting the case of reading development in three countries of Southeast Asia.

Sampling and comparisons from Southeast Asia: reading in multilingual contexts

Indeed, the primary barrier of evaluating the comparability of PISA scores is the difficulty in sampling (Bloem, 2013, 2015) specifically from regions that are allegedly performing at lower levels. Sampling comparability is critical in order to estimate trends in literacy development and impairment over time. To illustrate this issue more concretely, we selected three low-performing countries in Southeast Asia for comparison. These are Indonesia, Thailand, and Malaysia, countries with populations that are roughly 273 million, 70 million, and 32 million, respectively.

Our rationales for our selection of these three Southeast Asian countries were the following: First, they all fall under the middle-income, low performers group (The World Bank Group, 2019). Second, the measure was administered in one of the regional languages of all three of these countries. This is important because presumably taking a standardized test in one’s native language can be helpful. In a meta-analysis conducted by Melby-Lervåg and Lervåg (2014), the researchers have found that second language learners exhibited a medium-to-large effect size of poorer reading and language comprehension skills as compared to first language learners. Hence, in the Philippines, for example, the test is administered in English (OECD, 2019c), a language that children all learn in school, but one that is likely not the native language of the vast majority of the school children. Third, all three countries had participated in PISA prior to 2018, allowing us to make a trend comparison. Moreover, there are few studies published using PISA scores from Southeast Asia. In fact, most research publications on PISA are of the USA (114), Australia (72), Germany (69), the UK (52), and Ireland (31) (Hopfenbeck et al., 2018).

We examined students’ PISA reading performances in Indonesia, Thailand, and Malaysia (Table 1) for over a decade: the scores from 2009, 2012, and 2015 were compared to that of 2018. According to the OECD (2018c), the 2015 PISA scores of Malaysia are internationally incomparable ‘due to the potential of bias introduced by low response rates in the original PISA sample’ (p. 3). Approximately 5,000 students were sampled from each country, generating an aggregate comparability.

Table 1 Change in mean PISA Reading Scores over time, 2009 to 2018.

Table 1 demonstrates a general trend for all three countries included in the analysis to have either declined or stayed relatively steady in PISA scores across 10 years. None of them showed substantial improvement. In contrast, for reference, consider a country that has consistently secured high ranks, namely, South Korea. The performance of the high achieving students contributed to the significant 31 points increase in reading between PISA 2000 and PISA 2006 of South Korea (Schleicher, 2009). New questions arise from this contrast: First, if it is the advanced readers of a country, primarily, who start scoring higher, can this alone improve the average performance and, hence, the overall rank of the specific country? Second, how effectively do the items of the PISA actually assess the low achieving students, or the poorer readers?

PISA reading Level 2 is the baseline level of proficiency. Though it is not a starting point, students at this level can locate multiple pieces of information in a moderate length text and understand the relationships between them. In the absence of little to none extraneous information, they can figure out the central message relatively easily. Those who have scored below Level 2 are identified as low achievers in the PISA reading scale (OECD, 2016a). Following analyses for the PISA in 2012, low achievers are further divided into sub-categories of 2 and even lower (namely, 1a, 1b, 1c, and below).

As shown in Table 2, the majority of students in all the three countries are either baseline or poor achievers. That is, the percentage of participants from Indonesia, Malaysia and Thailand who are on level 2 and below are 91.7%, 77.3% and 85.6% respectively. This raises the question as to how more than half of the students tested in these countries can be poor readers. For reference, the percentages of those in the US, UK, and Australia who fell into the category of level 2 and below were 40.31%, 40% and 40.7% respectively.

Table 2 Percentage of the 2018 PISA scores of readers on and below Level 2.

Is it possible that the experience of PISA might differ across regions? Do items differ strongly based on language and/or educational practices and, hence, influence the performance? In Asia, it is likely that the PISA measures generally are less comparable across linguistic and cultural dimensions (Grisay & Monseur, 2007; Grisay et al., 2009). One of the cases of ‘PISA shock’ (Santos & Centeno, 2021) in Asia was that of Japan's PISA rank decreasing from the 8th in 2000 to the 14th in 2003 (Takayama, 2008). Japanese students omitted 9% of the items in the PISA 2009 reading measure (Okumura, 2014). This omission tendency was larger for the open-ended items than the closed ones supposedly because of lack of experience in generating sentences. As a concrete example of this phenomenon elsewhere, Hatzinikita et al., (2008) reported discrepancies between the languages used in Greek textbooks and PISA items. The former uses specialized scientific textual content whereas the linguistic mode of the latter is nonspecialized narratives focusing on scientific method. The authors hypothesized that years of experience informed by the textbook standard influenced the reading and inferences students make. Based on the analysis of data from English-, French-, and German-speaking countries, Blum et al., (2001) argued that the comparability claim is falsified by the significant association of item success rate with the language used.

Moreover, multilingualism is the norm for most people across the globe. Consequently, perhaps the majority of all children worldwide learn to read and write in a language different from the one spoken in their home environment (e.g., McBride, 2016). As mentioned above, dyslexia is indeed a cross-cultural phenomenon that occurs in all languages and scripts (Shaywitz et al., 2008). However, difficulties in reading are also likely to be more pronounced in the context of diglossia, defined as the linguistic distance between the language students learn to speak and the written form they are exposed to in literacy instruction (see Saiegh-Haddad et al., 2022). This is one reason that Vagh and Nag (2019) argued that Generalizability Theory and Item Response theory are inadequate for international usage and Akshara languages. For example, for about 80 percent of school students in Indonesia, Bahasa Indonesian is their second or third language (Elley, 1992). However, PISA testing in Indonesia consistently takes place in Bahasa Indonesia. Previous research has demonstrated that children’s literacy learning may be impeded when the home and school language differ particularly in low- and middle-income (Nag et al., 2019). It is not uncommon that children living in Asia, Indonesia and Malaysia particularly, might not always have the opportunity to choose a school language (i.e., medium of instruction) that best reflects their home language (see McBride et al., 2022). Furthermore, even in cases in which such children’s home language is indeed similar to that of the school language, namely, Malay, the variant of Malay to which they are exposed may be quite different. For example, Sneddon (2003) stated that most, if not all, Indonesian children are only exposed to the ‘informal’ variant of Malay (i.e., Bahasa Indonesia); only via schooling opportunities are they exposed to the ‘formal’ variant. Previous studies have shown that, due to the differences in grammar, vocabulary, and phonology of both Malay variants, i.e., formal and informal, among young children in Singapore (i.e,. Bahasa Melayu), exposure to the nonstandard ‘informal’ Malay variant leads to difficulties and challenges in learning standardized written Malay (Jalil & Liow, 2008 but see Habib et al., 2022 for different results); such difficulties also tend to affect their acquisition of literacy skills. The topic of diglossia in Malay warrants more research across Malay-speaking countries.

Prevalence of diglossia is common in various parts of the world. Spoken Arabic and Modern Standard Arabic is one of the most prevalent instances of diglossia (Maamouri, 1998; Saiegh-Haddad, 2003). The PISA-D report of Senegal estimated that for about 93.7% of grade 7 students, French, which is their medium of instruction in school, is not their home language (OECD, 2017; p. 49). The most recent 2020 PISA-D reported that only 28% of the respondents in Senegal speak French in their home environment. Likewise, only 17% of the participants in Paraguay reported speaking Spanish at home (OECD, 2020; p. 11). In Zambia, this estimate of a different language of instruction and PISA items from that of their home language is about 83%. Also, about 80% of students score below Level 1a (The Ministry of General Education, Zambia, 2017). According to the PISA reading proficiency criteria, this means that these readers struggle to understand the literal meaning of a short and simple text and identify the explicit relevant information or the purpose of the passages (OECD, 2018d). In all these countries, linguistic distance is a major hindrance of overall school performance such as grade repetition (Delprato, 2021).

In the 2018 PISA data of Thailand, students' and schools’ economic, social, and cultural status (ESCS) together explained 37.7% of the variance in reading scores within schools (OECD, 2019d). Using this same index and PISA scores from 2003 to2018, Lam and Zhou (2021) showed that the high performing education systems in East Asia also have consistent and significant socio-economic status and achievement gaps. This disparity is the least in Macao where the academic achievement and opportunities are not very different for students from across socio-economic strata.

How reliably can reading proficiency be compared across such diverse linguistic and cultural contexts? Perhaps, identifying struggling readers can be an equitable and effective starting point. To reiterate, similar to the challenge of construct equivalence of PISA, reading impairments can manifest differently for different languages. Lopes et al. (2020) analyzed 800 studies of dyslexia undertaken over the past two decades and found that clear criteria for participant recruitment were rarely made explicit. The common norm of demonstrating an IQ–reading discrepancy despite substantial evidence that this discrepancy requirement is not helpful for understanding persistent reading difficulties (Siegel, 1989) still persists (Tzouriadou, 2022). Elliott and Grigorenko (2014) made a provocative plea to replace the term dyslexia with reading disability. Amidst the debates, differences, and controversies, the term ‘dyslexia’ is here to stay for the foreseeable future (Elliott, 2020). If it is true that scientific knowledge generation and verification struggle to keep pace with colloquial parlance and societal discussion, what then should the role of scientists to society? Perhaps, recognizing the diversity of languages and learning contexts can be the central tenet of both science and policy.

Asia has many languages and scripts. Each country has unique challenges of educational curriculum requiring mastery of multiple languages and scripts, often dissimilar to the ones spoken in students’ home environments. Dyslexia is described and diagnosed differently across the globe, but the typical consensus is of impairment at the level of word reading. However, most international estimates of literacy assess reading and language comprehension and not word reading. Given the technical conditions and contextual challenges discussed hitherto, what can international large-scale literacy assessments teach us about dyslexia, both perils and prospects? A preliminary step could be to relook at the global perspective of literacy and refine the understanding of the association between early word reading and comprehension.

How is word reading related to reading comprehension across languages and scripts?

One fundamental question that remains critical is the extent to which word reading involves the same processes across languages and scripts. This issue of “universals” (Frost, 2012) and “specifics” of reading (Plaut, 2012) has centered on word reading. It is all the more challenging to extend what is known about dyslexia to reading comprehension difficulties. Moreover, how these early processes relate to PISA scores is unknown. Previous work (e.g., Daniels & Share, 2018; McBride & Mo, 2021) has underscored sources of variability in word reading and word writing across cultures. Differences across phonological, semantic, and visuo-orthographic aspects of print may or may not influence reading speed and accuracy. A consideration of these may optimize construct validity across cultures. Treating and reporting construct validity explicitly and accurately can facilitate the comparability of scores across contexts.

Some uniquenesses of reading are best illustrated with Thai. In Thai, both early and skilled readers frequently use syllable segmentation strategies and tone markers as salient cues. This is a critical language-specific aspect embedded within the instructional context. Thus, the formal teaching method for Thai primarily focuses on teaching children about correspondences between whole spoken and written syllables rather than about grapheme–phoneme correspondences (Winskel, 2013).

At the semantic level, Thai often involves substantial compounding. Thai also involves considerable top-down processing to disambiguate ambiguous phrases and sentences. It appears that more flexibility in processing is required in Thai print than in most other scripts and languages (e.g., Aroonmanakun, 2002). At the level of visuo-orthographic processing, most students’ literacy instruction in school involves reading and spelling similar words (Winskel & Ratitamkul, 2019). A lexical strategy usage is inferred based on the higher occurrences of lexical errors in reading both words and nonwords. Early Thai readers tend to inaccurately segment monosyllables (Winskel & Iemwanthong, 2010). This is a critical language-specific aspect embedded within the instructional context. Text is also typically presented unspaced, potentially slowing down reading in children (e.g., Kohsom & Gobet, 1997; Winskel et al., 2009). Unspaced Thai texts such as those used for the items presented in PISA can result in lower scores due to slower processing, longer time, and higher errors. If most cognitive resources (Ehri, 2005) are spent in reading non-spaced words and lexical quality remains low, reading comprehension can be severely impaired. If better word processing facilitates formation of compounds, which is a crucial characteristic of Thai morphology, perhaps students’ reading comprehension can also be enhanced.

What matters? The extent to which reading is affected by the variables mentioned above for the PISA is not yet known. For example, it is difficult to judge with certainty whether and how diglossia in Malay might affect reading comprehension. Equally, we do not know whether several aspects of text reading in Thai that differ substantially from English text reading influence speed of reading comprehension. Our focus here is primarily to point out what could matter for reading performance and to stimulate further research to understand these phenomena particularly vis-à-vis the issue of poor reading performance, including dyslexia.

Given these illustrations, we have three main suggestions for enhancing construct validity for future PISA and related work focused on identifying proportions of poor readers across regions. These are informed by lessons from diverse fields of inquiry. Global dyslexia discourse necessitates interdisciplinary perspectives. Petscher et al., (2020) proposed characteristics of “team translational science” which they consider to represent the roadmap for reading research (Solari et al., 2020). Inspired by it, we propose recommendations that correspond to the aforementioned challenges. These are to be considered as coexisting dependent components, rather than independent disparate elements.

Suggestion 1: including dyslexia lens in rethinking international literacy framework for educational policy consequence

What might large-scale literacy assessments look like if they are conceptualized and designed to identify poor decoders rather than proficient comprehenders? The explicit assumption is that of reading proficiency as a continuous distribution. That is, most students are neither dyslexic nor highly advanced; they are, by definition, in the middle of the distribution of all readers. Given this, a potential consideration is to have well-defined criteria of functional illiteracy (Vágvölgyi et al., 2021), categorically distinctive from both developmental dyslexia and skilled reading. One might argue that the same measure reporting proficient readers in specific languages and countries is disproportionately detecting poor readers in others.

If improving reading proficiency is the target and identifying students who are struggling readers is the starting point, then the formative question to ask is “What is the purpose of assessment?” We emphasize that understanding the issue of “how does this facilitate a student?” should be the consistent underlying theme for literacy stakeholders across all levels of decision making and action. This is also echoed by Winograd et al. (1991) who recommends goals of assessment to be explicitly and directly linked to interpretation for instruction. Applications for instruction can and should be a consistent anchor for all stages of dyslexia research design and implementation.

How can international large-scale assessment of reading and local small-scale identification and intervention of reading disability inform each other? This question is particularly important in countries in which a negligible research and development system of dyslexia (Mather et al., 2020) exists. The “science of reading” also needs to be understood within the context of the sociology of education and cultural anthropology. A truly international approach to dyslexia and broader literacy should also account for cultural practices and collective attitudes towards learning. The literacy without schooling (Scribner & Cole, 1978) evidence had alerted the scientific community to the importance of re-examining our conventional understanding of cognitive development and literacy. There are both major commonalities in reading across scripts and major cultural variations across the world; these two need not be antithetical.

Suggestion 2: methodological rigor for educational equity

Methodological rigor entails measures construction, sample selection, and data collection and reporting. At the onset, the equivalence of the construct and the measure across languages must be established (Papadopoulos et al., 2021). Before the initial phase of item construction and selection, a team of experts including researchers and teachers from the region should illustrate the linguistic distinctiveness of reading a particular script and how this poses challenges within their educational context (for a review, see Daniels & Share, 2018). A formal documentation and presentation of the arguments and evidence should be published beyond the claim of consultation. Variation within the standardization accounting for a correspondence between students’ background of knowledge and experiences with assessment (Snyder et al., 2005) might improve comparability. For instance, for Thailand, one might include a statement of how items assessing information retrieval in Thai can be weighted more by certain points than for languages/scripts containing spaced words since time and errors can be influenced by spacing.

A potential solution for measurement invariance could be the alignment method (Asparouhov & Muthén, 2014) treating this invariance as an optimization problem. The primary assumption of this method is that most items are approximately or partially invariant and the goal is to minimize the invariance by reducing the differences between factor loadings and item intercepts. Used as a procedure of exploratory analysis to identify non-variant items, the alignment method is perhaps best for group comparisons of latent mean scores (Luong & Flake, 2022). Suggestions include a cutoff of at least 25% (Asparouhov & Muthén, 2014) and 29% (Luong & Flake, 2022) non-variant items. It allows researchers to consider the extent to which the non-variance is theoretically and practically significant. For example, does the non-variance tell us what are the distinctive challenges of students with difficulties in reading Thai from that of Bahasa Indonesia? This decision about meaningful interpretation of differences in factor and item levels can be made in the construction and design phase prior to analysis. Running simulations across levels of measurement invariance with specific features of different groups could be helpful for making a more accurate decision about the continuity of reading from decoding and fluency to text comprehension.

Contrary to the ‘large-scale’ norm, Wagner (2011) proposed an alternative Smaller Quicker Cheaper (SQC) approach for literacy assessment. Informed by it, we suggest small but strategic sampling to go beyond the norm of studying and testing easily accessible groups and more frequent intervals of assessment. The former can be a means for including students and languages that have historically been excluded from any formal system of research and intervention. The latter might enable capturing of at-risk students without further delay. He also suggests shareability to equitably negotiate the conflicts of the etic (international comparison) objective with that of the emic approach (local contexts). This pertains to transparency in methods and measures development for ensuring replicability by the concerned sites without perpetual dependence on external experts.

For PISA sampling, a question that has not received enough scrutiny is “Where are the 15-year-olds?” (OECD, 2016b). Who we study informs what is reported as the norm and the outliers. Barrett (2020) calls for the new wave of Cross-Cultural Cognitive Science to adapt ‘principled sampling of people and phenomena’ (p. 683) using hypothesis-driven sampling and representative sampling. Most international assessments explicitly aim for the latter. However, comparability would be redundant if there were no a priori hypotheses of variation. If we have a hypothesis, for example, that reading in English and Thai share the similarity of phonemic awareness influencing early word reading but a differentiated influence of morphological knowledge on comprehension, then the test items should reflect this accordingly. An often-neglected fact is that there is a large middle ground between proficient readers and dyslexic readers. For measures to capture the expected similarity and/or variability, we suggest principled sampling. A common data collection and management framework mandating transparency in psychometric properties (Flake, 2021) of measures and statistical analyses of data is called for to monitor the progress and evaluate the effectiveness of an international assessment project.

Of particular relevance now is the feasibility of online data collection. This is a predicament accentuated by the Covid-19 pandemic. It can be amended to reach hitherto seldom included students from marginalized communities. At the same time, it can become yet another double-edged sword because internet and technology access is still a function of socio-economic resources in most countries. A way forward can be training and incentivizing researchers and other stakeholders from these settings to collect and share their data and findings in a crowd-sourced database such as the recently launched Global Literacy Assessment Dashboard (GLAD) (Patel, 2021). Since the GLAD is in its initial stage, this can be a useful platform from which to adapt lessons from similar global initiatives and also synergise the chasm between literacy and dyslexia.

Suggestion 3: communication and collaboration for accurate interpretation and effective implementation

In interpreting reading scores from a single country as well as comparing similarities or differences between countries, we make generalizations from the sample to the population. The generalizability challenge is the “horse before the cart” of cross-country valid comparisons. Generalizability extends beyond the population differences or the statistical analyses of the relationships. An essential strategy is to reconsider the causal relationships between variables and the exact mechanisms by which countries and/or students differ. Taking a lesson from the proposal of “Constraints on generality” (Simons et al., 2017) for empirical research and demonstrated in a cross-cultural study by Tiokhin et al. (2019), international literacy reports can include similar statements. For example, there was a decision to introduce the fluency test though not included for the cumulative reading scores of PISA purportedly to account for readers at a lower proficiency level. For this test, it is advisable to highlight that for certain languages and countries, the scores may vary significantly if interpreted within the word-level items only. An attempt in this direction to improve validity and generalizability amidst the variation across contexts can also be valuable for dyslexia discourse across cultures.

‘Different countries, different evidence?’, Strassheim and Kettunen (2014) questioned in light of international comparisons for science-informed policy. It is incumbent on us not to let this remain an ostensibly rhetorical remark while discussing literacy achievements and impairments globally. For evidence-based dyslexia research to mature into effective implementable programs, we recommend instruction-focused and culturally grounded institutionalized practices, from conceptualization to communication.

Conclusion

The multiplicity of problems and contexts need not be a deterrent for refining the construct of reading. We demonstrated it using the PISA validity framework, including the procedure and evidence with relevant examples of multi-lingual contexts of literacy development. Global scientific and educational movement of literacy is a worthwhile collective goal. The primary step could be the re-examination of the theoretical and operational definitions of reading abilities and disabilities to inform international assessments. Further, the explicit design and mandate of research and implementation systems are to promote transparent collaboration, equity and effectiveness. With common goals and strategies for enhanced literacy and learning, global networks with local partnerships can be tenable and mutually beneficial. The suggestions made if optimally adapted can enhance the accuracy and validity of assessments of global literacy achievements. Thus, concerted institutional initiatives can enable the capture of a clearer and granular picture of how students in the world read.