Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Oct;24(10):1435-1450.
doi: 10.1038/s41380-018-0321-0. Epub 2019 Jan 7.

Big data approaches to decomposing heterogeneity across the autism spectrum

Affiliations
Review

Big data approaches to decomposing heterogeneity across the autism spectrum

Michael V Lombardo et al. Mol Psychiatry. 2019 Oct.

Abstract

Autism is a diagnostic label based on behavior. While the diagnostic criteria attempt to maximize clinical consensus, it also masks a wide degree of heterogeneity between and within individuals at multiple levels of analysis. Understanding this multi-level heterogeneity is of high clinical and translational importance. Here we present organizing principles to frame research examining multi-level heterogeneity in autism. Theoretical concepts such as 'spectrum' or 'autisms' reflect non-mutually exclusive explanations regarding continuous/dimensional or categorical/qualitative variation between and within individuals. However, common practices of small sample size studies and case-control models are suboptimal for tackling heterogeneity. Big data are an important ingredient for furthering our understanding of heterogeneity in autism. In addition to being 'feature-rich', big data should be both 'broad' (i.e., large sample size) and 'deep' (i.e., multiple levels of data collected on the same individuals). These characteristics increase the likelihood that the study results are more generalizable and facilitate evaluation of the utility of different models of heterogeneity. A model's utility can be measured by its ability to explain clinically or mechanistically important phenomena, and also by explaining how variability manifests across different levels of analysis. The directionality for explaining variability across levels can be bottom-up or top-down, and should include the importance of development for characterizing changes within individuals. While progress can be made with 'supervised' models built upon a priori or theoretically predicted distinctions or dimensions of importance, it will become increasingly important to complement such work with unsupervised data-driven discoveries that leverage unknown and multivariate distinctions within big data. A better understanding of how to model heterogeneity between autistic people will facilitate progress towards precision medicine for symptoms that cause suffering, and person-centered support.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflict of interest.

Figures

Fig. 1
Fig. 1
Approaches to decomposing heterogeneity in autism. a A population of interest is shown, and autism cases are colored in green, pink, and blue. The different colors are meant to represent different autism subtypes. In b we show the impact of ignoring heterogeneity on effect size. With a typical case–control model, we ignore these possible subtype distinctions and compare autism to controls on some dependent variable. In this example scenario there is no clear case–control difference but the autism group shows higher variability (indicated by the larger error bars). An approach towards decomposing heterogeneity might be to construct a stratified model whereby we model the subtype labels instead of one autism label, and then re-examine differences on the hypothetical dependent variable of interest. In this example, the autism subtypes show contradictory effects. These effects are masked in the case–control model as the averaging cancels out the interesting different effects across the subgroups. c Heterogeneity is shown in autism as multi-level phenomena. This panel also visualizes the difference between broad versus deep big data characteristics and labels the top-down versus bottom-up approaches to understanding heterogeneity in this multi-level context. Finally, this panel also shows how development is another important dimension of heterogeneity to consider at each level of analysis (i.e., ‘chronogeneity’). In this example, chronogeneity is represented by different trajectories for different types of autism individuals
Fig. 2
Fig. 2
Case–control vs stratified model example with adult autism and mentalizing ability. This figure reports data from Lombardo et al. [25] on two independent datasets of adults with autism and performance on an advanced mentalizing test, the Reading the Mind in the Eyes Test (RMET). a (Discovery), b (Replication) Case–control differentiation and the standardized effect size for each dataset are shown. cf RMET scores and standardized effect sizes from the same two datasets after unsupervised data-driven stratification into five distinct autism subgroups and four distinct TD subgroups. Autism subgroups 1–2 are highly impaired on the RMET, while autism subgroups 3–5 are completely overlapping in RMET scores with the TD population
Fig. 3
Fig. 3
Simulation of sample effect size estimates at different sample sizes and across a range of true population effects for a hypothetical case–control study. In this simulation we set the population effect size to a range of different values, from very small (e.g., d = 0.1) to very large (e.g., d > 1.0) (panels ae show simulation results when effect size ranges from d = 0.1 to d = 0.9 in steps of 0.2). We then simulated data from two populations (cases and controls), each with n = 10,000,000, that had a case–control difference at these population effect sizes. Next, we simulated 10,000 experiments where we randomly sampled from these populations different sample sizes (n = 20, n = 50, n = 100, n = 200, n = 1000, n = 2000) and computed the sample effect size estimate (standardized effect size, Cohen’s d) for the case–control difference. These histograms (gray) show how variable the sample effect size estimates are (black lines show 95% confidence intervals) relative to the true population effect size (green line). Visually, it is quite apparent how small sample sizes (e.g., n = 20) have wildly varying sample effect size estimates and that this variability is consistent irrespective of what the true population effect size is. Overlaid on each gray histogram are red histograms that show the distribution of sample effect size estimates where the hypothesis test (e.g., independent samples t-test) passes statistical significance at p < 0.05. The rightward shift in this red distribution relative to the true population effect size (green line) illustrates the phenomenon of effect size inflation. The problem is much more pronounced at small sample sizes and when true population effects are smaller. We then computed what is the average effect size inflation for this red distribution and plotted this average effect size inflation as a percentage increase relative to the true population effect in (f). Each line in panel f refers to simulations with different sample sizes. This plot directly quantifies the degree of effect size inflation across a range of true population effects and across a range of sample sizes. The code for implementing and reproducing these simulations is available at https://github.com/mvlombardo/effectsizesim
Fig. 4
Fig. 4
Simulation showing sampling variability and bias of enrichment of specific strata in small sample size studies. In this simulation we generated a control population (n = 1,000,000) with a mean of 0 and a standard deviation of 1 on a hypothetical dependent variable (DV). We then generated an autism population (n = 1,000,000) with 5 different autism subtypes each with a prevalence of 20% (e.g., n = 200,000 for each subtype). These subtypes vary from the control population in effect size in units of 0.5 standard deviations, ranging from −1 to 1. This was done to simulate heterogeneity in the autism population that is reflective of very different types of effects. For example, the autism subtype 5 shows a pronounced increased response on the DV, whereas autism subtype 1 shows a pronounced decreased response on the DV. Across 10,000 simulated experiments, we then randomly sampled from the autism population sample sizes of n = 20, n = 200, and n = 2000, and computed the sample prevalence of each autism subtype. The ideal result without any bias would be sample prevalence rates of around 20% for each subtype. This 20% sample prevalence is approached at n = 2000, and to some extent at n = 200. However, small sample sizes such as n = 20 shows large variability in sample prevalence rates of the subtypes and this can markedly bias the results of a case–control comparison. The code for implementing and reproducing these simulations is available at https://github.com/mvlombardo/effectsizesim

Similar articles

Cited by

References

    1. Lai MC, Lombardo MV, Baron-Cohen S. Autism. Lancet. 2014;383:896–910. - PubMed
    1. Buescher AV, Cidav Z, Knapp M, Mandell DS. Costs of autism spectrum disorders in the United Kingdom and the United States. JAMA Pediatr. 2014;168:721–8. - PubMed
    1. Leigh JP, Du J. Brief Report: Forecasting the economic burden of autism in 2015 and 2025 in the United States. J Autism Dev Disord. 2015;45:4135–9. - PubMed
    1. Kapur S, Phillips AG, Insel TR. Why has it taken so long for biological psychiatry to develop clinical tests and what to do about it? Mol Psychiatry. 2012;17:1174–9. - PubMed
    1. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372:793–5. - PMC - PubMed

Publication types