Early identification of autism spectrum disorder (ASD) is crucial for accessing interventions that promote optimal behavioral, social, and functional outcomes. Signs of ASD manifest within the first years of life (Ozonoff et al., 2010; Zwaigenbaum et al., 2013) and a diagnosis can be reliably made by 18–24 months of age (Guthrie et al., 2013; Ozonoff et al., 2015; Pierce et al., 2019), yet the average age of ASD diagnosis in the United States remains around four years old (Maenner et al., 2023).

For over a decade, the American Academy of Pediatrics (AAP) has recommended routine developmental surveillance and ASD-specific screening for all children at 18- and 24-month well-child exams as a means to lower the diagnostic age (Hyman et al., 2020). However, the heterogeneous timing of ASD symptom emergence makes early identification a challenge (Ozonoff et al., 2018; Zwaigenbaum & Maguire, 2019), as does variable performance of screening tools, potential harm of classification errors, and unknown cost-benefit ratio of universal screening (McCarty & Frye, 2020; Yuen et al., 2018).

One of the most extensively used screening tools is the Modified Checklist for Autism in Toddlers (M-CHAT; Robins et al., 2001) and its latest revision, the M-CHAT-Revised with Follow-Up (M-CHAT-R/F; Robins et al., 2014); in this paper, we collectively refer to both versions of the instrument as the M-CHAT-R. A recent systematic review and meta-analysis reported pooled sensitivity of 83% (95% CI: 77–88%) and pooled specificity of 94% (95% CI: 89–97%) across 51 studies (Wieckowski et al., 2023).

Several factors influence the psychometric performance of the M-CHAT-R, including the type of sample (elevated likelihood versus unselected community), reporter familiarity with ASD, age at screening, number of screenings, use of follow-up interviews, time between screening and diagnosis (e.g., concurrent versus prospective identification), and ongoing surveillance of children who screen negative. Psychometric properties of sensitivity (SE), specificity (SP), and positive and negative predictive value (PPV/NPV) tend to be lower in community than clinical samples (Carbone et al., 2020; Guthrie et al., 2019; Toh et al., 2018) and in younger than older toddlers (Øien et al., 2018; Sturner et al., 2017a; Sturner, Howard, Bergmann, Stewart et al., 2017). In the few studies that have conducted repeated screenings to identify cases missed by initial screening, as recommended by the AAP, there is a large increase in SE without a compromise in SP (Wieckowski et al., 2023).

Accurate calculations of SE and SP require systematic procedures for identifying missed cases within the screen-negative sample, but in large community studies it is not practical or feasible to evaluate all screen-negative cases to confirm diagnostic status (Levy et al., 2020; McPheeters et al., 2016; Sheldrick et al., 2023). Samples that are at higher likelihood of an ASD outcome, such as younger siblings of children with ASD, can be helpful in this regard. The design of most infant sibling studies includes diagnostic evaluation of the entire sample, regardless of screening status, permitting calculation of SE and SP, in addition to PPV. Familial samples are also of interest because the psychometric properties of screeners may be different given enhanced knowledge of ASD symptoms in parents who already have an affected child. Only two studies have used the M-CHAT-R in toddlers at elevated likelihood for ASD due to family history. Bradbury et al. (2020) compared M-CHAT-R screening results of toddlers at higher likelihood for ASD (n = 187) to lower likelihood toddlers from the M-CHAT-R validation study (n = 15,848; Robins et al., 2014). The M-CHAT-R’s PPV in the elevated likelihood group (53%) was much higher than the PPV in the M-CHAT-R validation sample (14%). SE and SP could not be calculated since the lower likelihood sample was not systematically followed if they screened negative. Weitlauf et al. (2015) also used the M-CHAT-R in a sample at higher likelihood for ASD (n = 74) but did not provide lower likelihood comparison data. Most of the sample was followed, regardless of screening status, to later evaluation. The study reported that 18-month M-CHAT-R scores had high SE, SP, PPV, and NPV (generally above 70%) for concurrent 18-month diagnosis as well as predictive identification of ASD at 24- or 36-months. Neither study of higher likelihood infant siblings (Bradbury et al., 2020; Weitlauf et al., 2015) screened participants more than once and compared single to repeat screenings.

The goal of the current study is threefold: (1) compare psychometric properties of the M-CHAT-R in two non-overlapping samples of toddlers at higher likelihood for ASD, a single screened group (at 18 months) and a repeat screened group (at both 18 and 24 months); (2) prospectively follow all children, regardless of screening status, to outcome age at 36 months, to allow for calculation of SE, SP, PPV, and NPV; and (3) examine consistency in reporting of ASD symptoms across the M-CHAT-R and a developmental concerns interview in parents with higher versus lower familiarity with ASD based on family history.

Methods

Participants

The M-CHAT-R was completed by parents of 135 toddlers with an older sibling with ASD (Higher Likelihood or HL group) at 18 months. A subset (n = 75) was screened a second time at 24 months (see Table 1); this subsample was not selected based on clinical concerns or screening history. The smaller size of the repeat-screened group was due to later addition of the second screening to the study protocol, after some participants had already aged out of the 24-month screening window. HL infants had at least one older sibling with ASD (proband), whose diagnosis was confirmed using the Autism Diagnostic Observation Schedule, Second Edition (ADOS-2; Lord et al., 2012) and the Social Communication Questionnaire (SCQ; Rutter et al., 2003).

Table 1 HL Sample (n = 135) Demographic Characteristics

Parents of toddlers with lower likelihood (LL) of ASD (n = 88) also completed the M-CHAT-R. The LL sample was used as a comparison group to examine consistency in reporting of ASD symptoms across the M-CHAT-R and a developmental concerns interview. There were too few ASD diagnoses within the LL group (n = 3) to compare the M-CHAT-R’s screening properties (e.g., SE, SP, etc.) across the HL and LL groups. LL infants had typically developing older siblings confirmed by an intake interview and proband SCQ scores below the ASD cutoff. Exclusion criteria for both groups were birth before 32 weeks of gestation and a known genetic disorder in the proband. Parents provided informed consent and the study was approved by the university’s Institutional Review Board.

Measures

Visits were conducted at a major medical center in the United States at 18, 24, and 36 months by examiners unaware of HL v. LL status, as well as results from previous visits.

Modified-Checklist for Autism in Toddlers, Revised (M-CHAT-R; Robins et al., 2014) is a parent-report checklist with 20 yes/no response items. A positive screen is defined as 3 or more failed items. The M-CHAT-R questionnaire was mailed to parents and completed prior to their in-person evaluation visits; for one participant, the M-CHAT-R was filled out at 24 months but the family did not attend the visit, resulting in missing diagnostic data for one child at 24 months. The follow-up interview was not conducted as part of this study.

Parent Concerns Interview (Ozonoff et al., 2009). At 18- and 24-months, parents were asked if they had concerns about their child’s behavior or development. Responses were classified by coders trained to 80% reliability into 1 of 10 categories of concern (none, social, language, repetitive behavior, motor, medical, temperament/behavior, general developmental, other, and unspecified ASD concerns). Primary variable was number of ASD-related concerns (a sum of social, language, repetitive behavior, and unspecified ASD concerns).

Mullen Scales of Early Learning (MSEL; Mullen, 1995) is a standardized developmental test for children birth to 68 months that measures motor, cognitive, and language skills. It has good internal, test-retest, and interrater reliability and convergent validity.

Autism Diagnostic Observation Schedule-2 (ADOS-2; Lord et al., 2012) is a semi-structured play-based interaction and observation that provides a comparison score (range 1–10) with a cutoff of 4 that distinguishes ASD from non-ASD cases (Gotham et al., 2009). The ADOS-2 was administered at 18, 24, and 36 months by examiners who had completed rigorous research training and achieved 80% or higher reliability with a trainer throughout the study.

Diagnostic Classification. After each visit (18, 24, and 36 months), a binary diagnostic classification (ASD and Non-ASD) was made. The ASD classification was defined as obtaining an ADOS-2 comparison score of 4 or higher and meeting Diagnostic and Statistical Manual of Mental Disorders 5th ed. (DSM-5; American Psychiatric Association, 2013) criteria for ASD, verified by a licensed clinician. All participants not meeting these criteria were classified as Non-ASD. This classification could change from visit to visit, as signs of ASD emerged over time and (rarely) early diagnoses were not confirmed at later ages. In the HL sample, n = 11 were classified as ASD at 18 months, n = 14 at 24 months, and n = 22 at 36 months.

Statistical Analysis. SE (detecting ASD when it is truly present) was calculated by dividing the number of True Positives (i.e., participants with an M-CHAT-R score of 3 or above who received an ASD diagnosis) by the total number of children with an ASD outcome (True Positives/(True Positives + False Negatives) x 100). SP (detecting non-ASD cases accurately) was calculated by dividing true negatives (i.e., participants with an M-CHAT-R score of 2 or below who did not receive an ASD diagnosis) by the total number of children with a Non-ASD outcome classification (True Negatives/(True Negatives + False Positives) x 100). PPV (proportion of positive screens diagnosed with ASD) was calculated by dividing True Positives by all screen positives (True Positives/(True Positives + False Positives) x 100) and NPV (proportion of negative screens with Non-ASD outcomes) was calculated by dividing True Negatives for ASD by all screen negatives (True Negatives/(True Negatives + False Negatives) x 100). Differences in psychometric values in the single versus repeat screening groups were tested using Fisher’s exact test of independence (McDonald, 2009). The AAP considers SE and SP above 70% to be acceptable for ASD-specific screening measures (Council on Children With Disabilities, Section on Developmental Behavioral Pediatrics, Bright Futures Steering Committee, 2006).

SE, SP, PPV, and NPV were calculated for two non-overlapping HL samples: a single screen group (n = 60, 18 months) and a repeat screen group (n = 75, 18 and 24 months). We examined the M-CHAT-R’s ability to concurrently distinguish ASD from Non-ASD at 18 and 24 months, as well as to predictively identify it at 36 months, comparing single versus repeat screenings.

To examine whether reporting of early symptoms differs across parents with and without familiarity with ASD, analyses were conducted using both HL and LL participants. Separate bivariate correlations were conducted to examine the concordance between parent-reported ASD concerns and M-CHAT-R scores at 18 and 24 months in the HL and LL groups. We used Fisher’s Z transformation to statistically compare the magnitude of the relationships within the two likelihood groups. Finally, a repeated measures analysis of variance (ANOVA) was performed to analyze the effect of age (18 and 24 months), likelihood group (HL vs. LL), and ASD-related parent concerns on total M-CHAT-R scores. Simple main effects and interactions were analyzed. All analyses were implemented in IBM SPSS Statistics Version 28.0 (IBM SPSS Statistics for Windows, 2021).

Results

Preliminary analyses were conducted to examine whether there were any systematic differences between the HL participants who were screened only once and those screened twice. Demographic characteristics were similar across the single and repeat screened groups (Table 1), suggesting that the sample screened only at 18 months did not differ systematically from the sample screened twice.

Table 2 displays the psychometric properties of the M-CHAT-R at 18 and 24 months using concurrent diagnostic classifications. SE was higher at 18 than 24 months. SP and PPV were higher at 24 than 18 months, replicating previous research (Sturner, Howard, Bergmann, Stewart, et al., 2017). We then compared the M-CHAT-R’s ability to prospectively identify ASD at 36 months, comparing the single to the repeat screened group; see Table 3. SE and PPV were higher in the repeat screened group (SE: 89% v. 75% with single screening; PPV: 76% v. 50% with single screening), with only a small decrease in SP with re-screening (SP: 91% v. 95% with single screening), also replicating previous research.

Table 2 Psychometric properties of the M-CHAT-R at 18 and 24 months for concurrently identifying ASD and Non-ASD in the HL group (n = 135)
Table 3 Psychometric properties of the M-CHAT-R for single vs. repeat screenings for predictively distinguishing ASD from Non-ASD at 36 months for the HL group only (n = 135)

Finally, we examined the relationship between the total number of parent-reported ASD concerns and total M-CHAT-R scores at 18 and 24 months within each likelihood group (HL vs. LL). At both ages, the correlation coefficients for the HL group were higher than that for the LL group (18-months- HL: r = .45, p < .01; LL: r = .26, p < .05; 24-months- HL: r = .63, p < .01; LL: r = -.05 p > .05). The difference between the HL and LL correlation coefficients approached statistical significance (Fisher’s Z = 1.55; p = .06) at 18 months and was significantly different (Fisher’s Z = 3.75; p < .05) at 24 months.

A repeated measures ANOVA was conducted to analyze the effects of age, likelihood group (HL vs. LL), and dichotomously-defined ASD-related parent concerns (if parents ever had ASD concerns vs. no ASD concerns) on total M-CHAT-R scores (see Table 4; Fig. 1). There was no statistically significant difference of age on total M-CHAT-R scores (p > .05). There were significant main effects on M-CHAT-R scores of both likelihood group (p < .05) and parent-reported ASD concerns (p < .01), and a significant interaction between likelihood group and ASD concerns (F (1, 124) = 6.63, p < .01, p < .01, \({\eta }_{p}^{2}=\)0.05), indicating a small effect size (Cohen, 1988). Examination of simple effects demonstrated that the effect of ASD concerns depended on likelihood group. For the LL group, there was little difference in total M-CHAT-R scores between parents with and without ASD concerns. However, for the HL group, parents with ASD-related concerns had higher M-CHAT-R scores than those without ASD concerns (see Fig. 1).

Table 4 M-CHAT-R scores as a function of likelihood group, age, and parent concerns
Fig. 1
figure 1

Total M-CHAT R scores plotted by likelihood group and the presence or absence of at least one ASD-related parent concern at 18 and 24 months

Discussion

Many previously published studies of the M-CHAT-R did not have a comprehensive strategy for identifying missed cases among those who screened negative, following only those who screened positive and focusing on identification of false positives. This strategy, while more efficient and feasible for large-scale community screening, only permits calculation of PPV and not SE or SP (Levy et al., 2020; McPheeters et al., 2016; Sheldrick et al., 2023). The current study prospectively followed a large sample of HL toddlers, regardless of screening status, from initial screening at 18 months to final diagnostic classification at 36 months. This is one of only a few studies to follow both screen positives and all screen negatives to diagnostic outcome age, allowing more accurate estimation of all psychometric properties (SE, SP, PPV, NPV). We found high SE and SP (ranging from 75 to 95%) in this study that were in line with the pooled SE of 83% and SP of 94% reported in the meta-analysis of Wieckowski et al. (2023), which included studies with weaker methods for case confirmation among screen-negatives.

This is also one of only a handful of investigations (3 of 51 studies in a recent meta-analysis; Wieckowski et al., 2023) to compare the utility of single versus repeat screenings, finding a large increase in SE (89% v. 75%) with only a small decrease in SP (95% v. 91%) with repeated screening. This adds to the literature suggesting that repeated screenings, in line with AAP guidelines, may help reduce the age of ASD diagnosis (Guthrie et al., 2019; Hyman et al., 2020; Zwaigenbaum & Maguire, 2019). Repeat screening only modestly increased false positives, 60% of whom had other developmental concerns including high activity level and speech delays, and likely benefitted from additional evaluations. Non-autism developmental delays are common in siblings of children with ASD (Bradbury et al., 2020; Sacrey et al., 2015) and thus may have inflated the false positive rate in this study.

Finally, we explored differences in M-CHAT-R reporting across parents with more (HL group) versus less (LL group) knowledge of ASD, due to family history. We found higher concordance between the reporting of ASD concerns and M-CHAT-R screening in the HL group than the LL group, such that parents with an older child with ASD were more consistent in their reporting across the two measures. This suggests that M-CHAT-R scores may be more reflective of concerns in parents who have previous experience with ASD than in those with less familiarity with ASD. As was the case in previous HL samples (Bradbury et al., 2020; Weitlauf et al., 2015), the PPV estimates from the current HL sample (ranging from 43 to 76%) were much higher than the PPV of 14% reported in the M-CHAT-R validation sample (Robins et al., 2014). Collectively, these findings suggest that the M-CHAT-R may be particularly helpful for identifying possible ASD in toddlers with an older affected sibling.

There are several limitations to the current study. First, no M-CHAT-R follow-up interviews were conducted, which are known to reduce false positives and thereby improve PPV (Robins et al., 2014). Second, despite having a relatively large sample of HL toddlers, the overall sample size is much smaller than community screening studies. Finally, we were unable to directly compare psychometric properties of the M-CHAT-R in HL versus LL toddlers because too few participants in the LL group developed ASD.

The contributions of the present study include demonstration of the benefits of repeat screening and estimation of SE and SP in a large HL sample with diagnostic testing conducted regardless of screening status. This study design can identify cases missed by initial screening and thus provide more accurate calculations of SE and SP. Our results, using this comprehensive strategy to identify false negatives, are comparable to the pooled estimates of SE and SP for the M-CHAT-R recently reported in a meta-analysis (Wieckowski et al., 2023).

In conclusion, this data demonstrates the strong performance of the M-CHAT-R when completed by parents who already have a child with ASD, suggesting that providers should pay close attention to scores over the cutoff for toddlers in such families, making prompt referrals for diagnostic evaluation. Comprehensive developmental monitoring and repeated ASD screenings are recommended by the AAP for all children. While this study highlights their utility specifically for children at higher likelihood of ASD due to family history, regular and repeat screenings with the M-CHAT-R are also critical for children in the general population (Hyman et al., 2020), given the instrument’s ability to identify both autism and broader developmental concerns.