Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 1;185(18):3426-3440.e19.
doi: 10.1016/j.cell.2022.08.004.

High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

Collaborators, Affiliations

High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios

Marta Byrska-Bishop et al. Cell. .

Abstract

The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.

Keywords: 1000 Genomes Project; INDEL; SNV; population genetics; reference imputation panel; structural variation; trio sequencing; whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc. P.F. is an SAB member of Fabric Genomics, Inc., and Eagle Genomics, Ltd.

Figures

None
Graphical abstract
Figure 1
Figure 1
SNV/INDEL discovery in the high-coverage WGS data across the 3,202 1kGP samples (A) Counts of samples stratified by sex and super-population. Original: 2,504 original 1kGP samples. New: 698 newly added samples. (B) Cohort-level alternate allele counts of SNVs and INDELs across the 3,202 samples, stratified by AF bins. Novel/known: sites absent from/present in dbSNP build 155. AF was estimated based on the 2,504 unrelated samples. Pie chart: breakdown of all novel variants by the super-population ancestry. Gray area in the pie chart: novel sites that were called in more than one super-population. (C) Count of small variant loci per genome, stratified by population. See also Figures S1A–S1C. (D) Predicted functional SNVs and INDELs (autosomes). Top row: cohort-level counts (purple bar plot) overlaid with distributions of sample-level counts (boxplots) across the 2,504 unrelated samples. Middle row: fraction of rare (MAF ≤1%) SNVs and INDELs among the predicted functional sites. Bottom row: fraction of novel SNVs and INDELs among the predicted functional sites. See also Figures S1G and S1H. (E) Precision versus recall computed relative to the GIAB truth set v3.3.2, stratified by easy and difficult regions of the genome. See also Figure S1D. Super-population ancestry labels: EUR, European; AFR, African; EAS, East Asian; SAS, South Asian; AMR, American. Descriptions of population labels are in Table S1.
Figure S1
Figure S1
Evaluation of small variant calls, related to Figure 1 Sample-level counts of SNVs (A) and INDELs (B), stratified by super-population. (C) Sample-level Het/Hom ratios across small variants, stratified by super-population. (D) Counts of true positive (TP), false positive (FP), and false negative (FN) SNV and INDEL calls in easy and difficult regions of the genome (GIAB v3.3.2 high confidence regions only). (E) Sample-level singleton (sites with AC = 1 across 3,202 samples) counts, stratified by relatedness status. (F) Counts of true positive (TP) and false positive (FP) singletons in NA12878 relative to either the GIAB v3.3.2 or GIAB v4.2.1 truth set (GIAB high confidence regions only). Due to the presence of NA12878’s parental samples in the expanded cohort, the analysis using the 3,202-sample 1kGP call set is based on both de novos and inherited variants private to the NA12878 trio. (G) Sample-level counts of predicted functional small variants, stratified by super-population. Reported counts are across the 2,504 unrelated samples only. (H) Distributions of log2(ratios) of sample-level counts from (G) normalized by the mean count across the 2,504 unrelated samples. Super-population ancestry labels: European (EUR), African (AFR), East Asian (EAS), South Asian (SAS), American (AMR). Descriptions of population labels are in Table S1. Panels E, G, H are based on autosomes
Figure S2
Figure S2
Ploidy of each chromosome across the 3,202 samples, related to Figure 1 (A) Ploidy of allosomes. (B) Copy number (CN) of each chromosome. Each dot represents a copy number of the 1Mbp bin in a sample. Blue dots are samples with copy gain and red dots represent copy loss
Figure S3
Figure S3
Benchmark of GATK-SV, svtools, and Absinthe, related to Figure 2 (A) Overlap of insertion sites between GATK-SV and Absinthe call sets. (B) Overlap of SV other than insertions between the GATK-SV and svtools call set. (C) Overlap of SV sites of each type between GATK-SV, svtools, and Absinthe. (D) Overlap of insertions in each genome between GATK-SV and Absinthe. (E-G) Overlap of deletions (E), duplications (F), inversion and complex SVs (G) in each genome between GATK-SV and svtools. The integers in (D-G) represent count of SVs per sample, followed by proportion of SVs validated by VaPoR/proportion of SVs assessable by VaPoR in the second row, proportion of SVs supported by PacBio SVs in Ebert et al., (2021)/proportion of SVs supported by PacBio SVs in Chaisson et al. (2019) in the third row, and transmission rate/rate of biparentally inherited SVs in the fourth row. (H-I) Precision of the insertion breakpoint (H) and length (I) assessed against PacBio assemblies. (J-K) Precision of the SV breakpoints in GATK-SV (J) and svtools (K) call sets assessed against PacBio assemblies. (L) Breakpoint distance of SVs shared by GATK-SV and svtools. (M-N)de novo rate of SVs in GATK-SV (M) and svtools (N) call set when filtered at different boost score cutoffs. (O) False positives and false negatives in the GATK-SV and svtools call sets when filtered at different boost score cutoffs
Figure S4
Figure S4
Comparison of small variant calls to the phase 3 call set, related to Figure 3 (A) Length of INDELs in the high-coverage as compared to the phase 3 call sets. (B) Number of true positive (TP), false positive (FP), and false negative (FN) SNVs and INDELs in the high-coverage vs. phase 3 call set, stratified by easy and difficult regions of the genome (GIAB v3.3.2 high confidence regions only). (C) Comparison of allele frequencies in the high-coverage vs. the phase 3 call set across shared loci, stratified by variant type and regions of the genome. r: Pearson correlation coefficient. Number of false positive (FP), true positive (TP), and unassessed (NA; sites outside of the GIAB v3.3.2 high confidence regions of the genome) predicted functional SNVs (D) and INDELs (E) in sample NA12878, defined based on the comparison against the GIAB NA12878 truth set v3.3.2. There were no stop-loss INDELs in sample NA12878 hence no plot for that category in E. See also Figures 3G and 3H (bottom row). Panels A, C, D, E: chr1-22; panel B: chr1-22 and X
Figure 2
Figure 2
SV discovery in the high-coverage WGS data across the 3,202 1kGP samples (A–C) The count (A), size distribution (B), and allele frequency distribution (C) of each SV class. (D–F) The mean per sample count of SVs by variant class (D) and ancestral population (E) is also provided, as well as inheritance and transmission rates (F) of all SVs. In (F), child inheritance rate refers to the proportion of SVs in a child inherited from the parents. Parental transmission rate refers to the proportion of SVs in parents’ genomes that are transmitted and displayed here are all informative SVs that are only heterozygous in one parental genome. Vertical colored lines in each row represent the mean value, whereas numbers on the right margin represent median SV counts across the children or families. SV Classes: DEL, deletion; DUP, duplication; mCNV, multiallelic copy number variant; INS, insertion; INV, inversion; CPX, complex SV; CTX, inter-chromosomal translocation. Super-population ancestry labels: EUR, European; AFR, African; EAS, East Asian; SAS, South Asian; AMR, American. Descriptions of population labels are in Table S1. See also Figure S3.
Figure 3
Figure 3
Comparison of small variant calls to the phase 3 call set (A and B) Number of SNVs (A) and INDELs (B) across the 2,504 samples in phase 3 and high-coverage datasets, stratified by AF bins and regions of the genome. Secondary y axis: % of autosomal phase 3 variants recalled in the high-coverage call set across SNVs (A) and INDELs (B) in easy and difficult regions of the genome. See also Figure S4C. (C and D) Comparison of FDR across SNVs (C) and INDELs (D) between the high-coverage and phase 3 call sets, stratified by AF bins and regions of the genome. See also Figure S4B. (E and F) Sample-level SNV (E) and INDEL (F) counts in the phase 3 versus high-coverage call sets, stratified by 1kGP super-population ancestry. EUR, European; AFR, African; EAS, East Asian; SAS, South Asian; AMR, American. Reported counts are at a locus level. (G and H) Comparison of predicted functional SNV (G) and INDEL (H) counts in the high-coverage versus phase 3 call set. Log2(ratio) denotes ratio of variant counts in the high-coverage versus phase 3 call set. Top row: cohort-level comparison. Middle row: sample-level comparison. Bottom row: comparison of FDR. Red asterisks mark categories with fewer than 100 sites in sample NA12878 (i.e., categories where FDR estimation is less reliable). See also Figures S4D and S4E. FDR in (C), (D), (G), and (H) was estimated based on comparison of calls in sample NA12878 to the GIAB truth set v3.3.2. (A), (B), and (E–H): chromosomes (chr) 1–22; (C) and (D): chr1–22 and X.
Figure 4
Figure 4
Comparison of the ensemble SV calls to the phase 3 call set (A) Count of SV sites in the current ensemble SV call set and phase 3 SV call set and their overlap. Numbers next to each bar represent the counts of SV sites in each dataset. (B) The distribution of SV counts per sample in both call sets and their average overlap, displayed in the Venn diagram. (C) Count of genes altered by SVs in both datasets. pLoF, predicted loss of function; CG, complete copy gain; IED, intragenic exon duplication. (D) Count of genes altered by SVs across ancestral populations. See also Figure S5.
Figure S5
Figure S5
Comparison of gene interruptive SVs in the high-coverage ensemble versus phase 3 1kGP call sets, related to Figure 4 (A) Count of genes interrupted as predicted loss of function (pLoF), (B) intragenic exon duplications (IED), and (C) complete copy gain (CG) by SVs in the high-coverage ensemble call set and 1kGP phase 3 SV call set. Super-population ancestry labels: European (EUR), African (AFR), East Asian (EAS), South Asian (SAS), American (AMR)
Figure 5
Figure 5
Small variant phasing and imputation performance (A) Counts of small variants passing specified filtering criteria (chr1–22 and X; top 10 combinations of filtering criteria in terms of variant counts are shown). PASS, sites that passed VQSR; Miss., genotype missingness; HWE, Hardy-Weinberg Equilibrium exact test p value > 1e-10 in at least one of the five 1kGP super-populations; ME, mendelian error rate across complete trios; MAC, minor allele count. See also Table S6. (B) Haplotype phasing accuracy of the high-coverage and the phase 3 1kGP panel. SER, switch error rate relative to the Platinum Genome truth set. Two additional phasing conditions (dashed lines) are shown for the high-coverage panel for evaluation purposes only: (1) diamonds: SER obtained when phasing NA12878 without parents included in the cohort. (2) Triangles: SER obtained when phasing NA12878 with parents included but without the pedigree-based correction (duohmm) applied. See also Figures S6A and S6B. (C) Haplotype phasing accuracy of the high-coverage panel, stratified by relationship status. SER was computed relative to the HGSVC SNV call set (Ebert et al., 2021). See also Figure S6C. (D) Imputation accuracy of SNV and INDEL genotypes imputed using the high-coverage panel, stratified by genomic regions. Mean r2, squared Pearson correlation coefficient averaged over 110 SGDP samples. See also Figures S6D–S6G. (E) Comparison of the imputation accuracy between the high-coverage and phase 3 panels for SNVs and INDELs, stratified by super-population ancestry. EUR, European; AFR, African; EAS, East Asian; SAS, South Asian; AMR, American. The comparison was restricted to sites that are shared between the two panels. (B–E) are based on autosomes.
Figure 6
Figure 6
SV phasing and imputation performance (A) Cohort-level counts of filtered SVs included in the integrated haplotype-resolved panel, stratified by the SV type (chr1–22 and X). (B) Distribution of sample-level flip rate of phased HET DELs and INSs that were evaluated for phasing accuracy against the HGSVC truth set. (C) Distribution of sample-level parental flip rate of phased HET SVs, stratified by SV type. (D) SV imputation performance of the high-coverage panel in the SGDP study dataset, stratified by SV type. Mean r2, squared Pearson correlation coefficient between imputed allelic dosages and dosages from the SV “truth set,” averaged over the 110 SGDP samples (except for the AF = 0.5% bin: 100 and 92 samples for INSs and DELs, respectively). (E) Counts of SVs imputed in the SGDP study dataset using the high-coverage reference panel at info >0.4 (left) and info >0.8 (right) across three MAF bins (MAF based on 110 imputed SGDP samples). (B–E) are based on autosomes. SV types: DEL, deletions; INS, insertions; DUP, duplications; INV, inversions.
Figure S6
Figure S6
SNV/INDEL phasing and imputation performance, related to Figure 5 SER: switch error rate stratified by (A) chromosome and (B) variant type. Note: SER on chr21 in the 0.1–1% MAF bin is equal to 0 (i.e. no switch errors found). This is a fluctuation due to low variant counts per MAF bin in sample NA12878 as chromosomes get smaller. Chromosome X is shown separately in (B) as it was phased using a different strategy than autosomes (statistical phasing vs. statistical phasing with pedigree-based correction, respectively). (C) Impact of inclusion of trios on the phasing accuracy of the 1kGP high-coverage call set, stratified by relationship status in the 3,202-sample cohort. log10(SER ratio) refers to the ratio of SER in the phasing run including trios (n = 3,202 samples) vs. phasing run without trios (n = 2,504 samples), computed relative to the HGSVC truth set (1 child, 5 parents, 9 unrelated samples). Imputation accuracy of the high-coverage panel stratified by super-population for SNVs (D, E) and INDELs (F, G) in easy and difficult regions of the genome. Imputation accuracy was estimated as described in Figure 5D. (H-L) Imputation accuracy of the high-coverage panel for each of the five super-populations, stratified by the population. (M) Genotype discordance rates for SNVs and INDELs imputed using the high-coverage and phase 3 panels stratified by super-population. (N) Counts of SNVs and INDELs imputed in the SGDP study dataset using the high-coverage vs. the phase 3 reference panel at info >0.4 (left) and info >0.8 (right) across three MAF bins (MAF based on the 110 imputed SGDP samples). Panels C-N are based on autosomes
Figure S7
Figure S7
SV phasing and imputation performance, related to Figure 6 (A) Distribution of sample-level fractions of HET SVs (DELs and INSs) that were assessed for phasing accuracy against the HGSVC truth set in Figure 6B. (B) Distribution of sample-level fractions of HET SVs (DELs, INSs, DUPs, INVs) that were assessed for phasing accuracy using parental flip rate as shown in Figure 6C. (C) Fraction of SV sites (DELs and INSs; out of all DELs and INSs included in the high-coverage panel) that was included in the imputation performance evaluation against the HGSVC truth set shown in Figure 6D. (D) Upset plot showing site-level overlap of DELs and INSs discovered in the high-coverage 1kGP call set with those discovered in the long-read-based HGSVC call set used as the truth set. Overlap criteria: breakpoint position within +/−50 bp from the start site in the 1kGP call set and 80% length overlap. SV types: DEL: deletions, INS: insertions, DUP: duplications, INV: inversions

Comment in

Similar articles

Cited by

References

    1. Abel H.J., Larson D.E., Regier A.A., Chiang C., Das I., Kanchi K.L., Layer R.M., Neale B.M., Salerno W.J., Reeves C., et al. Mapping and characterization of structural variation in 17, 795 human genomes. Nature. 2020;583:83–89. doi: 10.1038/s41586-020-2371-0. - DOI - PMC - PubMed
    1. Abyzov A., Urban A.E., Snyder M., Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21:974–984. doi: 10.1101/gr.114876.110. - DOI - PMC - PubMed
    1. Almeida R., Ricaño-Ponce I., Kumar V., Deelen P., Szperl A., Trynka G., Gutierrez-Achury J., Kanterakis A., Westra H.-J., Franke L., et al. Fine mapping of the celiac disease-associated LPP locus reveals a potential functional variant. Hum. Mol. Genet. 2014;23:2481–2489. doi: 10.1093/hmg/ddt619. - DOI - PMC - PubMed
    1. Andrews S. FastQC. 2019. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    1. Broad Institute Picard Toolkit, Github Repository. 2019. http://broadinstitute.github.io/picard/

Publication types

LinkOut - more resources