Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May;581(7809):444-451.
doi: 10.1038/s41586-020-2287-8. Epub 2020 May 27.

A structural variation reference for medical and population genetics

Collaborators, Affiliations

A structural variation reference for medical and population genetics

Ryan L Collins et al. Nature. 2020 May.

Erratum in

  • Author Correction: A structural variation reference for medical and population genetics.
    Collins RL, Brand H, Karczewski KJ, Zhao X, Alföldi J, Francioli LC, Khera AV, Lowther C, Gauthier LD, Wang H, Watts NA, Solomonson M, O'Donnell-Luria A, Baumann A, Munshi R, Walker M, Whelan CW, Huang Y, Brookings T, Sharpe T, Stone MR, Valkanas E, Fu J, Tiao G, Laricchia KM, Ruano-Rubio V, Stevens C, Gupta N, Cusick C, Margolin L; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium; Taylor KD, Lin HJ, Rich SS, Post WS, Chen YI, Rotter JI, Nusbaum C, Philippakis A, Lander E, Gabriel S, Neale BM, Kathiresan S, Daly MJ, Banks E, MacArthur DG, Talkowski ME. Collins RL, et al. Nature. 2021 Feb;590(7846):E55. doi: 10.1038/s41586-020-03176-6. Nature. 2021. PMID: 33536627 Free PMC article. No abstract available.

Abstract

Structural variants (SVs) rearrange large segments of DNA1 and can have profound consequences in evolution and human disease2,3. As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD)4 have become integral in the interpretation of single-nucleotide variants (SNVs)5. However, there are no reference maps of SVs from high-coverage genome sequencing comparable to those for SNVs. Here we present a reference of sequence-resolved SVs constructed from 14,891 genomes across diverse global populations (54% non-European) in gnomAD. We discovered a rich and complex landscape of 433,371 SVs, from which we estimate that SVs are responsible for 25-29% of all rare protein-truncating events per genome. We found strong correlations between natural selection against damaging SNVs and rare SVs that disrupt or duplicate protein-coding sequence, which suggests that genes that are highly intolerant to loss-of-function are also sensitive to increased dosage6. We also uncovered modest selection against noncoding SVs in cis-regulatory elements, although selection against protein-truncating SVs was stronger than all noncoding effects. Finally, we identified very large (over one megabase), rare SVs in 3.9% of samples, and estimate that 0.13% of individuals may carry an SV that meets the existing criteria for clinically important incidental findings7. This SV resource is freely distributed via the gnomAD browser8 and will have broad utility in population genetics, disease-association studies, and diagnostic screening.

PubMed Disclaimer

Conflict of interest statement

K.J.K. owns stock in Personalis. A.O’D.-L. has received honoraria from ARUP and Chan Zuckerberg Initiative. B.M.N. is a member of the scientific advisory board at Deep Genomics and consultant for Camp4 Therapeutics, Takeda Pharmaceutical, and Biogen. M.J.D. is a founder of Maze Therapeutics. D.G.M. is a founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme. M.E.T has received research support from Levo Therapeutics. All other authors declare no competing interests. S.K. is an employee of Verve Therapeutics, and holds equity in Verve Therapeutics, Maze Therapeutics, Catabasis, and San Therapeutics. He is a member of the scientific advisory boards for Regeneron Genetics Center and Corvidia Therapeutics; he has served as a consultant for Acceleron, Eli Lilly, Novartis, Merck, Novo Nordisk, Novo Ventures, Ionis, Alnylam, Aegerion, Haug Partners, Noble Insights, Leerink Partners, Bayer Healthcare, Illumina, Color Genomics, MedGenome, Quest, and Medscape; he reports patents related to a method of identifying and treating a person having a predisposition to or afflicted with cardiometabolic disease (20180010185) and a genetics risk predictor (20190017119).

Figures

Fig. 1
Fig. 1. Properties of SVs across human populations.
a, SV classes catalogued in this study. We also documented unresolved non-reference ‘breakends’ (BNDs), but they were excluded from all analyses as low-quality variants. b, After quality control, we analysed 14,237 samples across continental populations, including African/African American (AFR), Latino (AMR), East Asian (EAS), and European (EUR), or other populations (OTH). Three publicly available WGS-based SV datasets are provided for comparison (1000 Genomes Project (1000G), approximately 7× coverage; Genome of the Netherlands Project (GoNL), around 13× coverage; Genotype-Tissue Expression Project (GTEx), approximately 50× coverage),,. c, We discovered 433,371 SVs, and provide counts from previous studies for comparison,,. d, A principal component (PC) analysis of genotypes for 15,395 common SVs separated samples along axes corresponding to genetic ancestry. e, The median genome contained 7,439 SVs. f, Most SVs were small. Expected Alu, SVA and LINE1 mobile element insertion peaks are marked at approximately 300 bp, 2.1 kb and 6 kb, respectively. g, Most SVs were rare (allele frequency (AF) < 1%), and 49.8% of SVs were singletons (solid bars). h, Allele frequencies were inversely correlated with SV size across all 335,470 resolved SVs in unrelated individuals. Values are mean and 95% confidence interval from 100-fold bootstrapping. Colour codes are consistent between a, c, eh, and between b and d.
Fig. 2
Fig. 2. Complex SVs are abundant in the human genome.
We resolved 5,295 complex SVs across 11 mutational subclasses, 73.7% of which involved at least one inversion. Each subclass is detailed here, including their mutational signatures, structures, abundance, density of SV sizes (vertical line indicates median size), and allele frequencies. Five pairs of subclasses have been collapsed into single rows due to mirrored or similar alternative allele structures (for example, delINV versus INVdel). Two complex SVs did not conform to any subclass (Extended Data Fig. 8).
Fig. 3
Fig. 3. Genome-wide mutational patterns of SVs.
a, Mutation rates (μ) from the Watterson estimator for each SV class. Bars represent 95% confidence intervals. Rates of molecularly validated de novo SVs from 519 quartet families are provided for comparison. b, Smoothed enrichment of SVs per 100-kb window across the average of all autosomes normalized by chromosome arm length (a ‘meta-chromosome’) (Supplementary Fig. 16). c, The distribution of SVs along the meta-chromosome was dependent on variant class. d, SV enrichment by class and chromosomal position provided as mean and 95% confidence intervals (CI). C, centromeric; I, interstitial; T, telomeric. P values were computed using a two-sided t-test and were Bonferroni-adjusted for 21 comparisons. *P ≤ 2.38 × 10−3.
Fig. 4
Fig. 4. Pervasive selection against SVs in genes mirrors coding short variants.
a, Four categories of gene-overlapping SVs, with counts of total SVs, median SV size, and mean SVs per gene in gnomAD-SV. b, Count of genes altered by SVs per genome. Horizontal lines indicate medians. Sample sizes per category listed in Supplementary Table 9. c, APS value for SVs overlapping genes. Bars indicate 100-fold bootstrapped 95% confidence intervals. SVs per category listed in Supplementary Table 9. d, Relationships of constraint against pLoF SNVs versus gene-overlapping SVs in 100 bins of around 175 genes each, ranked by SNV constraint. Correlations were assessed with a two-sided Spearman correlation test. Solid lines represent 21-point rolling means. See Supplementary Fig. 19 for comparisons to missense constraint.
Fig. 5
Fig. 5. Dosage sensitivity in the noncoding genome.
a, Strength of selection (APS) for noncoding CNVs overlapping 14 categories of noncoding elements (Supplementary Table 5). Bars reflect 95% confidence intervals from 100-fold bootstrapping. Each category was compared to neutral variation (APS = 0) using a one-tailed t-test. Categories surpassing Bonferroni-corrected significance for 32 comparisons are indicated with dark shaded points. SVs per category listed in Supplementary Table 9. DEL, deletion; DUP, duplication; TAD, topologically associating domain; TF, transcription factor. b, CNVs that completely covered elements (‘full’) had significantly higher average APS values than CNVs that only partially covered elements (‘partial’). P values calculated using a two-tailed paired two-sample t-test for the 14 categories from a. c, d, Spearman correlations between sequence conservation and APS for noncoding deletions (n = 143,353) (c) and duplications (n = 30,052) (d). Noncoding CNVs were sorted into 100-percentile bins based on the sum of the phastCons scores overlapped by the CNV. Correlations were assessed with a two-sided Spearman correlation test. Solid lines represent 21-point rolling means.
Fig. 6
Fig. 6. gnomAD-SV as a resource for clinical WGS interpretation.
a, Comparison of carrier frequencies for 49 putatively disease-associated deletions (red) and duplications (blue) at genomic disorder loci between gnomAD-SV and microarray analyses in the UK Biobank (UKBB). Light bars indicate binomial 95% confidence intervals. Solid grey line represents linear best fit. b, At least one pLoF or copy-gain SV was detected in 36.9% and 23.7% of all autosomal genes, respectively. ‘Constrained’ and ‘unconstrained’ includes the least and most constrained 15% of all genes based on LOEUF, respectively. c, Carrier rates for very rare (allele frequency < 0.1%) pLoF SVs in medically relevant genes across several gene lists,,. SVs per category listed in Supplementary Table 9. d, Carrier rates for very large (≥1 Mb) rare autosomal SVs among 12,653 genomes. Bars represent binomial 95% confidence intervals. e, A complex SV involving at least 49 breakpoints and seven chromosomes (also see Extended Data Fig. 8). Teal arrows indicate insertion point into chromosome 1.
Extended Data Fig. 1
Extended Data Fig. 1. Detection of chromosome-scale dosage alterations.
We estimated ploidy (that is, whole-chromosome copy number) for all 24 chromosomes per sample. a, Distribution of autosome ploidy estimates across 14,378 samples passing initial data quality thresholds. White diamonds indicate medians. Individual points are outlier samples at least three standard deviations away from the cohort-wide mean. The outlier points marked in red and blue correspond to the samples highlighted in be. be, Samples with outlier autosome ploidy estimates typically contained somatic or mosaic chromosomal abnormalities, such as somatic aneuploidy of chromosome 1 (chr1) (b) or chromosome 8 (e), or large focal somatic or mosaic CNVs on chromosome 3 (c) and chromosome 7 (d). Each panel depicts copy-number estimates in 1-Mb bins for each rearranged sample in red or blue. Dark, medium and light-grey background shading indicates the range of copy number estimates for 90%, 99% and 99.9% of all gnomAD-SV samples, respectively, and the medium grey line indicates the median copy number estimate across all samples. Regions of unalignable N-masked bases >1 Mb in the reference genome are masked with grey rectangles. f, Sex chromosome ploidy estimates for all samples from a. We inferred karyotypic sex by clustering samples to their nearest integer ploidy for sex chromosomes. Several abnormal sex chromosome ploidies are marked, including XYY (i), XXY (ii), XXX (iii), and mosaic loss-of-Y (iv). g, Histogram representation of the data from f. Essentially all samples conformed to canonical sex chromosome ploidies.
Extended Data Fig. 2
Extended Data Fig. 2. Benchmarking the technical qualities of the gnomAD-SV callset.
We evaluated the quality of gnomAD-SV with seven orthogonal analyses detailed in Supplementary Table 4, Supplementary Figs. 6–9 and Supplementary Note 1. Four core analyses are presented here. a, Apparent rates of de novo (that is, spontaneous) heterozygous SVs per child across 970 parent–child trios. Each point is a single trio, and vertical lines denote medians. Given the expected mutation rate of SVs accessible to short-read WGS, (<1 true de novo SV per trio; see also Fig. 3a), effectively all de novo SVs represented a combination of false-positive genotypes in children and/or false-negative genotypes in parents. SVs passing all filters and included in the final gnomAD-SV callset (‘pass’) are shown in green. For comparison, variants that did not pass post hoc site-level filters (‘not pass’) are also shown in purple. b, Hardy–Weinberg equilibrium (HWE) metrics for all biallelic SVs localized to autosomes. Deviation from HWE was assessed using a chi-square goodness-of-fit test with one degree of freedom. Vertex labels reflect genotypes: 0/0 denotes homozygous reference; 0/1 denotes heterozygous; and 1/1 denotes homozygous alternate, with all sites shaded by chi-squared P value. c, Linkage disequilibrium between SVs and SNVs or indels for 23,706 common (allele frequency > 1%) SVs represented as cross-population maximum R2 values after excluding repetitive and low-complexity regions (see Supplementary Fig. 7). Points and vertical bars represent medians and interquartile ranges, respectively. d, Correlation of allele frequency (AF) for 37,907 common SVs captured by both the 1000 Genomes Project and gnomAD-SV. Pearson’s correlation coefficient (R2) is provided.
Extended Data Fig. 3
Extended Data Fig. 3. In silico confirmation of SVs in gnomAD-SV with long-read WGS.
We used Pacific Biosciences (PacBio) long-read WGS data available for four samples in this study to perform in silico confirmation to estimate the positive predictive value and breakpoint accuracy for SVs in gnomAD-SV,, (Supplementary Fig. 10). a, Counts of SVs evaluated per sample in this analysis. SVs were restricted to those with breakpoint-level read support (that is, ‘split-read’ evidence, 92.8% of all SVs) and did not have breakpoints localized to annotated simple repeats or segmental duplications. b, An iterative local long-read WGS realignment algorithm, VaPoR, was used to perform in silico confirmation of SVs predicted from short-read WGS in gnomAD-SV. As noted by the VaPoR developers, the performance of this approach was sensitive to the sequencing depth of long-read WGS data. Therefore, the weighted mean of the four samples was used as a study-wide long-read WGS confirmation rate, weighting the confirmation rate of each sample based on the square root of its long-read WGS sequencing depth. c, Confirmation rates stratified by SV class, size and allele frequency. A mean of 4,829 SVs per sample were assessed. Horizontal green bars denote weighted means.
Extended Data Fig. 4
Extended Data Fig. 4. SVs contribute a substantial burden of rare, homozygous, and coding mutations per genome.
ad, Counts of SVs per genome across a variety of parameters, corresponding to median counts of total SVs (a), homozygous SVs (b), rare SVs (c) and singleton SVs (d). Samples are grouped by population and coloured by SV types. The solid bar to the left of each population indicates the population median. eg, Median counts of genes disrupted by SVs per genome when considering all SVs (including MCNVs) (e), homozygous SVs (including MCNVs) (f), and rare SVs (g). Colours correspond to predicted functional consequence. h, Counts of pLoF SVs per genome. For certain categories, such as genes disrupted by rare SVs per genome, a subset of samples (<5%) were enriched above the population average, as expected for individuals carrying large, rare CNVs predicted to cause the disruption of dozens or hundreds of genes (see Extended Data Fig. 1); for the purposes of visualization, the y axis for all panels has been restricted to a maximum of three interquartile ranges above the third quartile across all samples for each category.
Extended Data Fig. 5
Extended Data Fig. 5. Rearrangement size is a primary determinant of allele frequency for most classes of SVs.
a, Proportion of singleton SVs in five SV size bins for each class of biallelic SVs considered in this study. Intergenic SVs (light colours; n = 206,954) exhibited reduced singleton proportions when compared to all SVs (dark colours; n = 335,470) of the same size and class. Bars reflect 95% confidence intervals from 100-fold bootstrapping. Categories with fewer than ten SVs are not shown. b, To account for the strong dependency of singleton proportion on SV size and class, we developed the APS metric, which normalizes singleton proportions using SV-specific technical and genomic covariates to permit comparisons of the frequency spectra across SV classes (see Supplementary Fig. 14). The same data as in a are shown, transformed onto the APS scale, which shows effectively no dependency on SV size for intergenic SVs. Bars reflect 95% confidence intervals from 100-fold bootstrapping. Residual deviation from APS = 0 is maintained when considering all SVs, owing to APS being intentionally calibrated to intergenic SVs as a proxy for neutral variation. Because larger SVs are more likely to be gene-disruptive, they upwardly bias the APS point estimates due to residual negative selection not captured by SV size alone. Counts of SVs per category for both a and b are listed in Supplementary Table 9.
Extended Data Fig. 6
Extended Data Fig. 6. Most SVs within genes appear under negative selection.
a, Enrichments for pLoF consequences among rare and singleton SVs across SV classes. b, Enrichments for non-pLoF functional consequences among rare and singleton SVs across SV classes. c, Adjusted proportion of singletons across SV types and functional consequences. d, APS among deletions relative to count of exons and whole genes deleted. e, Fractions of all autosomal protein-coding genes with at least one SV across a variety of functional consequences. f, Relationship of APS and constraint against pLoF SNVs. For this analysis, intronic, promoter and UTR SVs were required to have precise breakpoints (that is, have ‘split-read’ support) to protect against any cryptic overlap with coding sequence unable to be annotated due to imprecise breakpoints. For c, d and f, points and vertical bars represent 95% confidence intervals from 100-fold bootstrapping, respectively. Counts of SVs per category in c and d are provided in Supplementary Table 9. For d and f, deletions in highly repetitive or low-complexity sequence (≥30% coverage by annotated segmental duplications or simple repeats) were excluded.
Extended Data Fig. 7
Extended Data Fig. 7. gnomAD-SV can augment disease association studies.
a, Functional enrichments of 2,307 common SVs in strong linkage disequilibrium (R2 ≥ 0.8) with an SNV associated with a trait or disease in the GWAS catalogue or the UK Biobank,. Points represent odds ratios of SVs being in strong linkage disequilibrium with at least one GWAS-significant SNV among all SVs in strong linkage disequilibrium with at least one SNV (total n = 15,634 SVs). Single and triple asterisks correspond to nominal (P < 0.05) and Bonferroni-corrected (P < 0.0083) significance thresholds from a two-sided Fisher’s exact test, respectively. Bars represent 95% confidence intervals. Test statistics, SV counts, and P values are provided in Supplementary Table 6. b, Example locus at 16q22.1, where we identified a 336-bp deletion in strong linkage disequilibrium with SNVs significantly associated with hypothyroidism in the UK Biobank. Top, the GWAS signal among genotyped SNVs in the UK Biobank, coloured by strength of linkage disequilibrium (Pearson’s R2 value) with the 336-bp deletion identified in gnomAD-SV. Bottom, the local genomic context of this deletion, which overlaps an annotated intronic Alu element near (<1 kb) the first exon of a highly constrained, thyroid-expressed gene, ATP6V0D1. The deletion lies amidst histone mark peaks commonly found at active enhancers (H3K27ac and H3K4me1) based on publicly available chromatin data from adult thyroid samples, a phenotype-relevant tissue. Human Alu elements are known to frequently act as enhancers, and the sentinel hypothyroidism SNV from the UK Biobank GWAS is a significant expression-modifying variant (that is, eQTL) for ATP6V0D1 and other nearby genes across many tissues, which indicates that the hypothyroidism risk haplotype modifies expression of ATP6V0D1 and/or other genes, potentially through the deletion of an intronic enhancer,.
Extended Data Fig. 8
Extended Data Fig. 8. An extremely complex SV involving 49 breakpoints and seven chromosomes.
A highly complex insertion rearrangement from gnomAD-SV in which 47 segments from six different chromosomes were duplicated and inserted into a single locus on chromosome 1, forming a 626,065 bp stretch of contiguous inserted sequence composed of shattered fragments. Given the involvement of multiple chromosomes, the signature of localized shattering, and the clustered breakpoints, we note that this rearrangement has several hallmarks of germline chromothripsis, which has been observed in healthy adults previously, albeit rarely. However, unlike previous reports of germline chromothripsis, there are no apparent whole-chromosome translocations, and all segments were duplicated before being inserted in a compound manner into chromosome 1, potentially suggesting a replication-based repair mechanism. The exact origin of this rearrangement is unclear. a, Circos representation of all 49 breakpoints and seven chromosomes involved in this SV. Teal arrows indicate insertion point into chromosome 1. b, The median segment size was 8.4 kb. c, Linear representation of the rearranged inserted sequence. Colours correspond to chromosome of origin, and arrows indicate strandedness of the inserted sequence, relative to the GRCh37 reference.

Comment in

Similar articles

Cited by

References

    1. Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. - PMC - PubMed
    1. Perry GH, et al. Copy number variation and evolution in humans and chimpanzees. Genome Res. 2008;18:1698–1710. - PMC - PubMed
    1. Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 2013;14:125–138. - PubMed
    1. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature10.1038/s41586-020-2308-7 (2020). - PMC - PubMed
    1. Walsh R, et al. Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet. Med. 2017;19:192–203. - PMC - PubMed

Publication types

Grants and funding