. 2021 May 25;12(1):3152.

doi: 10.1038/s41467-021-22889-4.

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Zihuai He^{1

2}, Linxi Liu³, Chen Wang⁴, Yann Le Guen⁵, Justin Lee⁶, Stephanie Gogarten⁷, Fred Lu⁸, Stephen Montgomery^{9

10}, Hua Tang^{8

9}, Edwin K Silverman¹¹, Michael H Cho¹¹, Michael Greicius⁵, Iuliana Ionita-Laza¹²

Affiliations

¹ Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA. zihuai@stanford.edu.
² Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA. zihuai@stanford.edu.
³ Department of Statistics, Columbia University, New York, NY, USA.
⁴ Department of Biostatistics, Columbia University, New York, NY, USA.
⁵ Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA.
⁶ Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA.
⁷ Department of Biostatistics, University of Washington, Seattle, WA, USA.
⁸ Department of Statistics, Stanford University, Stanford, CA, USA.
⁹ Department of Genetics, Stanford University, Stanford, CA, USA.
¹⁰ Department of Pathology, Stanford University, Stanford, CA, USA.
¹¹ Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine Division, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
¹² Department of Biostatistics, Columbia University, New York, NY, USA. ii2135@cumc.columbia.edu.

PMID: 34035245
PMCID: PMC8149672
DOI: 10.1038/s41467-021-22889-4

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Zihuai He et al. Nat Commun. 2021.

. 2021 May 25;12(1):3152.

doi: 10.1038/s41467-021-22889-4.

Authors

Affiliations

¹ Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA. zihuai@stanford.edu.
² Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA. zihuai@stanford.edu.
³ Department of Statistics, Columbia University, New York, NY, USA.
⁴ Department of Biostatistics, Columbia University, New York, NY, USA.
⁵ Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA.
⁶ Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA.
⁷ Department of Biostatistics, University of Washington, Seattle, WA, USA.
⁸ Department of Statistics, Stanford University, Stanford, CA, USA.
⁹ Department of Genetics, Stanford University, Stanford, CA, USA.
¹⁰ Department of Pathology, Stanford University, Stanford, CA, USA.
¹¹ Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine Division, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
¹² Department of Biostatistics, Columbia University, New York, NY, USA. ii2135@cumc.columbia.edu.

PMID: 34035245
PMCID: PMC8149672
DOI: 10.1038/s41467-021-22889-4

Abstract

The analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Overview of KnockoffScreen.**
a Knockoff generation based on the original genotype matrix. Each row in the matrix corresponds to an individual and each column corresponds to a genetic variant. Each cell presents the genotype value/dosage. b Calculation of the importance score for each 1 bp, 1 kb, 5 kb, or 10 kb window. c Example of genome-wide screening results using conventional association testing (top) and KnockoffScreen (bottom).

**Fig. 2. Power and false discovery rate (FDR) simulation studies in a single region.**
The four panels show power and FDR base on 500 replicates for different types of traits (quantitative and dichotomous) and different types of variants (rare and common), with different target FDR varying from 0 to 0.2. The different colors indicate different knockoff generators. The different types of lines indicate different tests to define the importance score. Source data are provided as a Source Data file.

**Fig. 3. Genome-wide power and false discovery rate (FDR) simulations studies in the presence of multiple causal loci.**
a, c Empirical power for different types of traits (quantitative and dichotomous), defined as the average proportion of 200 kb causal loci being identified at target FDR 0.1. b, d Empirical FDR for different types of traits (quantitative and dichotomous) at different resolutions, defined as the proportion of significant windows (target FDR 0.1) ± 100/75/50 kb away from the causal windows. The empirical power and FDR have averaged over 100 replicates. Source data are provided as a Source Data file.

**Fig. 4. KnockoffScreen prioritizes causal variants/loci and distinguishes the signal due to rare variants from shadow effects of significant common variants nearby.**
a, b Results of the data analyses of the APOE ± 100 kb region from the ADSP data. Each dot represents a window. Windows selected by KnockoffScreen are highlighted in red. Windows selected by conventional association testing but not by KnockoffScreen are shown in gray. c–e Simulation results based on the APOE ± 100 kb region, comparing the conventional association testing and KnockoffScreen methods in terms of c, frequency of selected variants/windows overlapping with the causal region; d Maximum distance of selected variants/windows to the causal region; e number of false positives due to shadow effect. The target FDR is 0.1. The density plots are based on 500 replicates. Source data are provided as a Source Data file.

**Fig. 5. Empirical evaluation of KnockoffScreen in the presence of population stratification.**
a Principal component analysis of the ADSP data, which contains three ethnic groups: African American (AA), Non-Hispanic White (NHW), and Others (of which, 98% are Caribbean Hispanic). Each dot represents an individual. b, c Simulation results for the FDR control in the presence of population stratification that mimics the ADSP data, comparing KnockoffScreen with conventional association testing. Each panel shows empirical FDR based on 500 replicates. KnockoffScreen 10PCs is a modified version of KnockoffScreen method that includes adjustment for the top principal components while computing the association statistics (p-values). KnockoffScreen controls FDR at 0.10; Association Testing is based on usual Bonferroni correction (0.05/number of tests), controlling FWER at 0.05. Source data are provided as a Source Data file.

**Fig. 6. KnockoffScreen application to the Alzheimer’s Disease Sequencing Project (ADSP) data to identify variants associated with the Alzheimer’s Disease.**
a Manhattan plot of p-values (truncated at $10^{- 20}$ for clear visualization) from the conventional association testing with Bonferroni adjustment ( $p < 0.05$ /number of tested windows) for FWER control. b Manhattan plot of KnockoffScreen with target FDR at 0.1. c heatmap that shows stratified p-values (truncated at $10^{- 10}$ for clear visualization) of all loci passing the FDR = 0.1 threshold, and the corresponding Q-values that already incorporate correction for multiple testing. The loci are shown in descending order of the knockoff statistics. For each locus, the p-values of the top associated single variant and/or window are shown indicating whether the signal comes from a single variant, a combined effect of common variants or a combined effect of rare variants. The names of those genes previously implicated by GWAS studies are shown in bold (names were just used to label the region and may not represent causative gene in the region). Source data are provided as a Source Data file.

**Fig. 7. KnockoffScreen application to the COPDGene study in TOPMed to identify variants associated with FEV₁ in Non Hispanic White (NHW).**
a Manhattan plot of p-values from the conventional association testing with Bonferroni adjustment ( $p < 0.05$ /number of tested windows) for FWER control. b Manhattan plot of KnockoffScreen with target FDR at 0.1. c Heatmap that shows stratified p-values of all loci passing the FDR = 0.1 threshold, and the corresponding Q-values that already incorporate correction for multiple testing. The loci are shown in descending order of the knockoff statistics. For each locus, the p-values of the top associated single variant and/or window are shown indicating whether the signal comes from a single variant, a combined effect of common variants, or a combined effect of rare variants. The names of those genes previously implicated by GWAS studies are shown in bold (names were just used to label the region and may not represent causative gene in the region). Source data are provided as a Source Data file.

**Fig. 8. Scatter plot of genome-wide W statistic vs. −log10 (p-value).**
Each dot represents one variant/window. The dashed lines show the significance thresholds defined by Bonferroni correction (for p-values) and by false discovery rate (FDR; for W statistic). The p-values are from the conventional association testing described in the main text. Source data are provided as a Source Data file.

**Fig. 9. Simulation studies to evaluate the stability and reproducibility of different knockoff procedures.**
Different colors indicate different knockoff procedures: KnockoffScreen, single knockoff and MK – Maximum (the multiple knockoff method based on the maximum statistic proposed by Gimenez and Zou). All three methods are based on the same knockoff generator proposed in this paper for a fair comparison. The stability (a, c) is quantified as the variation of $τ_{Φ_{k l}}$ across 100 replicates due to randomly sampling knockoffs for a given data (left and right panels). The reproducibility (b) is quantified as the frequency of a causal window being selected across 100 replicates. Source data are provided as a Source Data file.

See this image and copyright information in PMC

Cited by

Enhancing credit scoring accuracy with a comprehensive evaluation of alternative data.
Hlongwane R, Ramaboa KKKM, Mongwe W. Hlongwane R, et al. PLoS One. 2024 May 21;19(5):e0303566. doi: 10.1371/journal.pone.0303566. eCollection 2024. PLoS One. 2024. PMID: 38771812 Free PMC article.
Key variants via the Alzheimer's Disease Sequencing Project whole genome sequence data.
Wang Y, Sarnowski C, Lin H, Pitsillides AN, Heard-Costa NL, Choi SH, Wang D, Bis JC, Blue EE; Alzheimer's Disease Neuroimaging Initiative (ADNI); Boerwinkle E, De Jager PL, Fornage M, Wijsman EM, Seshadri S, Dupuis J, Peloso GM, DeStefano AL; Alzheimer's Disease Sequencing Project (ADSP). Wang Y, et al. Alzheimers Dement. 2024 May;20(5):3290-3304. doi: 10.1002/alz.13705. Epub 2024 Mar 21. Alzheimers Dement. 2024. PMID: 38511601 Free PMC article.
Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression.
Chen Z, He Z, Chu BB, Gu J, Morrison T, Sabatti C, Candès E. Chen Z, et al. ArXiv [Preprint]. 2024 Feb 20:arXiv:2402.12724v1. ArXiv. 2024. PMID: 38463500 Free PMC article. Preprint.
Knowledge domains and emerging trends of Genome-wide association studies in Alzheimer's disease: A bibliometric analysis and visualization study from 2002 to 2022.
Kong F, Wu T, Dai J, Cai J, Zhai Z, Zhu Z, Xu Y, Sun T. Kong F, et al. PLoS One. 2024 Jan 19;19(1):e0295008. doi: 10.1371/journal.pone.0295008. eCollection 2024. PLoS One. 2024. PMID: 38241287 Free PMC article.
Estimating gene-level false discovery probability improves eQTL statistical fine-mapping precision.
Wang QS, Edahiro R, Namkoong H, Hasegawa T, Shirai Y, Sonehara K; Japan COVID-19 Task Force; Kumanogoh A, Ishii M, Koike R, Kimura A, Imoto S, Miyano S, Ogawa S, Kanai T, Fukunaga K, Okada Y. Wang QS, et al. NAR Genom Bioinform. 2023 Oct 30;5(4):lqad090. doi: 10.1093/nargab/lqad090. eCollection 2023 Dec. NAR Genom Bioinform. 2023. PMID: 37915762 Free PMC article.

See all "Cited by" articles

References

1. RK CY, et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 2017;20:602–611. doi: 10.1038/nn.4524. - DOI - PMC - PubMed
1. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv, 563866 (2019). - PMC - PubMed
1. Morrison AC, et al. Practical approaches for whole-genome sequence analysis of heart- and blood-related traits. Am. J. Hum. Genet. 2017;100:205–215. doi: 10.1016/j.ajhg.2016.12.009. - DOI - PMC - PubMed
1. Sazonovs A, Barrett JC. Rare-variant studies to complement genome-wide association studies. Annu Rev. Genomics Hum. Genet. 2018;19:97–112. doi: 10.1146/annurev-genom-083117-021641. - DOI - PubMed
1. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics. 2014;198:497–508. doi: 10.1534/genetics.114.167908. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Affiliations

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources