Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 25;12(1):3152.
doi: 10.1038/s41467-021-22889-4.

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Affiliations

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Zihuai He et al. Nat Commun. .

Abstract

The analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of KnockoffScreen.
a Knockoff generation based on the original genotype matrix. Each row in the matrix corresponds to an individual and each column corresponds to a genetic variant. Each cell presents the genotype value/dosage. b Calculation of the importance score for each 1 bp, 1 kb, 5 kb, or 10 kb window. c Example of genome-wide screening results using conventional association testing (top) and KnockoffScreen (bottom).
Fig. 2
Fig. 2. Power and false discovery rate (FDR) simulation studies in a single region.
The four panels show power and FDR base on 500 replicates for different types of traits (quantitative and dichotomous) and different types of variants (rare and common), with different target FDR varying from 0 to 0.2. The different colors indicate different knockoff generators. The different types of lines indicate different tests to define the importance score. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Genome-wide power and false discovery rate (FDR) simulations studies in the presence of multiple causal loci.
a, c Empirical power for different types of traits (quantitative and dichotomous), defined as the average proportion of 200 kb causal loci being identified at target FDR 0.1. b, d Empirical FDR for different types of traits (quantitative and dichotomous) at different resolutions, defined as the proportion of significant windows (target FDR 0.1) ± 100/75/50 kb away from the causal windows. The empirical power and FDR have averaged over 100 replicates. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. KnockoffScreen prioritizes causal variants/loci and distinguishes the signal due to rare variants from shadow effects of significant common variants nearby.
a, b Results of the data analyses of the APOE ± 100 kb region from the ADSP data. Each dot represents a window. Windows selected by KnockoffScreen are highlighted in red. Windows selected by conventional association testing but not by KnockoffScreen are shown in gray. ce Simulation results based on the APOE ± 100 kb region, comparing the conventional association testing and KnockoffScreen methods in terms of c, frequency of selected variants/windows overlapping with the causal region; d Maximum distance of selected variants/windows to the causal region; e number of false positives due to shadow effect. The target FDR is 0.1. The density plots are based on 500 replicates. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Empirical evaluation of KnockoffScreen in the presence of population stratification.
a Principal component analysis of the ADSP data, which contains three ethnic groups: African American (AA), Non-Hispanic White (NHW), and Others (of which, 98% are Caribbean Hispanic). Each dot represents an individual. b, c Simulation results for the FDR control in the presence of population stratification that mimics the ADSP data, comparing KnockoffScreen with conventional association testing. Each panel shows empirical FDR based on 500 replicates. KnockoffScreen 10PCs is a modified version of KnockoffScreen method that includes adjustment for the top principal components while computing the association statistics (p-values). KnockoffScreen controls FDR at 0.10; Association Testing is based on usual Bonferroni correction (0.05/number of tests), controlling FWER at 0.05. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. KnockoffScreen application to the Alzheimer’s Disease Sequencing Project (ADSP) data to identify variants associated with the Alzheimer’s Disease.
a Manhattan plot of p-values (truncated at 1020 for clear visualization) from the conventional association testing with Bonferroni adjustment (p<0.05/number of tested windows) for FWER control. b Manhattan plot of KnockoffScreen with target FDR at 0.1. c heatmap that shows stratified p-values (truncated at 1010 for clear visualization) of all loci passing the FDR = 0.1 threshold, and the corresponding Q-values that already incorporate correction for multiple testing. The loci are shown in descending order of the knockoff statistics. For each locus, the p-values of the top associated single variant and/or window are shown indicating whether the signal comes from a single variant, a combined effect of common variants or a combined effect of rare variants. The names of those genes previously implicated by GWAS studies are shown in bold (names were just used to label the region and may not represent causative gene in the region). Source data are provided as a Source Data file.
Fig. 7
Fig. 7. KnockoffScreen application to the COPDGene study in TOPMed to identify variants associated with FEV1 in Non Hispanic White (NHW).
a Manhattan plot of p-values from the conventional association testing with Bonferroni adjustment (p<0.05/number of tested windows) for FWER control. b Manhattan plot of KnockoffScreen with target FDR at 0.1. c Heatmap that shows stratified p-values of all loci passing the FDR = 0.1 threshold, and the corresponding Q-values that already incorporate correction for multiple testing. The loci are shown in descending order of the knockoff statistics. For each locus, the p-values of the top associated single variant and/or window are shown indicating whether the signal comes from a single variant, a combined effect of common variants, or a combined effect of rare variants. The names of those genes previously implicated by GWAS studies are shown in bold (names were just used to label the region and may not represent causative gene in the region). Source data are provided as a Source Data file.
Fig. 8
Fig. 8. Scatter plot of genome-wide W statistic vs. −log10 (p-value).
Each dot represents one variant/window. The dashed lines show the significance thresholds defined by Bonferroni correction (for p-values) and by false discovery rate (FDR; for W statistic). The p-values are from the conventional association testing described in the main text. Source data are provided as a Source Data file.
Fig. 9
Fig. 9. Simulation studies to evaluate the stability and reproducibility of different knockoff procedures.
Different colors indicate different knockoff procedures: KnockoffScreen, single knockoff and MK – Maximum (the multiple knockoff method based on the maximum statistic proposed by Gimenez and Zou). All three methods are based on the same knockoff generator proposed in this paper for a fair comparison. The stability (a, c) is quantified as the variation of τΦkl across 100 replicates due to randomly sampling knockoffs for a given data (left and right panels). The reproducibility (b) is quantified as the frequency of a causal window being selected across 100 replicates. Source data are provided as a Source Data file.

Similar articles

Cited by

References

    1. RK CY, et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 2017;20:602–611. doi: 10.1038/nn.4524. - DOI - PMC - PubMed
    1. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv, 563866 (2019). - PMC - PubMed
    1. Morrison AC, et al. Practical approaches for whole-genome sequence analysis of heart- and blood-related traits. Am. J. Hum. Genet. 2017;100:205–215. doi: 10.1016/j.ajhg.2016.12.009. - DOI - PMC - PubMed
    1. Sazonovs A, Barrett JC. Rare-variant studies to complement genome-wide association studies. Annu Rev. Genomics Hum. Genet. 2018;19:97–112. doi: 10.1146/annurev-genom-083117-021641. - DOI - PubMed
    1. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics. 2014;198:497–508. doi: 10.1534/genetics.114.167908. - DOI - PMC - PubMed

Publication types

Grants and funding

LinkOut - more resources