Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2021 Dec 2;108(12):2336-2353.
doi: 10.1016/j.ajhg.2021.10.009. Epub 2021 Nov 11.

Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics

Affiliations
Meta-Analysis

Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics

Zihuai He et al. Am J Hum Genet. .

Abstract

Knockoff-based methods have become increasingly popular due to their enhanced power for locus discovery and their ability to prioritize putative causal variants in a genome-wide analysis. However, because of the substantial computational cost for generating knockoffs, existing knockoff approaches cannot analyze millions of rare genetic variants in biobank-scale whole-genome sequencing and whole-genome imputed datasets. We propose a scalable knockoff-based method for the analysis of common and rare variants across the genome, KnockoffScreen-AL, that is applicable to biobank-scale studies with hundreds of thousands of samples and millions of genetic variants. The application of KnockoffScreen-AL to the analysis of Alzheimer disease (AD) in 388,051 WG-imputed samples from the UK Biobank resulted in 31 significant loci, including 14 loci that are missed by conventional association tests on these data. We perform replication studies in an independent meta-analysis of clinically diagnosed AD with 94,437 samples, and additionally leverage single-cell RNA-sequencing data with 143,793 single-nucleus transcriptomes from 17 control subjects and AD-affected individuals, and proteomics data from 735 control subjects and affected indviduals with AD and related disorders to validate the genes at these significant loci. These multi-omics analyses show that 79.1% of the proximal genes at these loci and 76.2% of the genes at loci identified only by KnockoffScreen-AL exhibit at least suggestive signal (p < 0.05) in the scRNA-seq or proteomics analyses. We highlight a potentially causal gene in AD progression, EGFR, that shows significant differences in expression and protein levels between AD-affected individuals and healthy control subjects.

Keywords: Alzheimer disease; GWAS; knockoff statistics; omics; sequencing.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1
Figure 1
Overview of KnockoffScreen-AL (A) The KnockoffScreen-AL method. (B) The application of KnockoffScreen-AL to UK biobank data. (C) Venn diagrams showing the number of identified loci that overlap with known AD loci or being replicated (p < 0.05). Common, common variant loci; rare, rare-variant loci; overlap with known AD loci, overlap with Jansen et al. and Kunkle et al.; replication, replication p value < 0.05 based on summary statistics from Kunkle et al. (D) Venn diagrams showing the number of implicated genes that are significant (p < 0.05) in scRNA-seq or proteomics analysis; KS-AL only: the additional genes identified by KnockoffScreen-AL but missed by conventional association tests; ProteomicsAging: p value < 0.05 in the proteomics analysis of age effect; ProteomicsADvsHC: p value < 0.05 in the proteomics analysis comparing Alzheimer disease-affected individuals to healthy control subjects; scRNA-seq: p value < 0.05 in the scRNA-seq analysis for at least one cell type.
Figure 2
Figure 2
Computing time, peak random-access memory (RAM) use, power, and FDR of different knockoff generators (A and B) The computing time and RAM were evaluated based on 2,000 variants, varying the sample size from 1,000 to 500,000. Naive SCIT, sequential conditional independent tuples (SCIT) with the “exact” linear model; BM, memory-efficient matrix operation. The shrinkage algorithmic leveraging BM method corresponds to the proposed KnockoffScreen-AL. The computing time for naive SCIT is truncated at sample size 100,000 because it cannot be applied to larger sample size. We also benchmark the computing time for phasing 10,000 samples via fastPhase with number of states K = 12. (C and D) Power/FDR comparison between KnockoffScreen-AL and the naive SCIT. (E and F) Power/FDR comparison between KnockoffScreen-AL (SCIT multiple knockoffs + ACAT-O) and other existing knockoff generators and feature importance score calculations. The different colors indicate different knockoff generators. The different types of lines indicate different tests to define the importance score.
Figure 3
Figure 3
Genome-wide analysis of Alzheimer disease in UK Biobank (A) The Manhattan plot of p values (truncated at 10−50 for clear visualization) from the conventional common-variant and rare-variant association tests with conventional GWAS threshold (p < 5 × 10−8) for FWER control. (B) The Manhattan plot of KnockoffScreen-AL with target FDR at 0.10. The names of those loci previously reported by GWASs are shown in purple; names of discoveries not included in Jansen et al. and Kunkle et al. are shown in red (FDR = 0.05) and blue (FDR = 0.10).
Figure 4
Figure 4
Single-cell RNA-seq data (n = 143,793) analysis of the 43 proximal genes For each gene, we present the differentially expressed genes (DEG) analysis, comparing Alzheimer disease-affected individuals (AD) with healthy control subjects. (A) All 43 proximal genes. (B) The additional genes identified by KnockoffScreen-AL but missed by conventional association tests. Each dot represents a gene. Colors represent different cell types. The black dashed lines present p value cutoff at 0.05; the gray dashed lines present p value cutoff at 0.05/43 (number of candidate genes). For visualization purpose, −log10(p) was capped at 15 and abs(log2(fold change)) was capped at 1.0. Positive log2 fold change corresponds to higher expression level in AD.
Figure 5
Figure 5
Proteomics data analysis of genes at the 31 significant loci In addition to the 43 proximal genes, we additionally include genes within ±200 kb at each significant loci that can be matched with proteomics profile. (A and D) We present the differential abundance analysis comparing Alzheimer disease (AD)-affected individuals with healthy control subjects (HC) (A) and evaluated the age effect (D). Each dot presents a gene. Different colors represent different types of significance. NS, not significant; log2FC: |log2 fold change| ≥ 0.05; p value: p value ≤ 0.05; p value and log2FC: |log2 fold change| ≥ 0.05 and p value ≤ 0.05. The dashed gray lines correspond to the Bonferroni correction p value threshold 0.05/78 = 0.00064. (B and C) Differential abundance analysis of EGFR/TREM2. (E and F) Age effect analysis of EGFR/TREM2. MCI, mild cognitive impairment; LBD, Lewy body dementia.
Figure 6
Figure 6
Colocalization analysis of EGFR (A) Colocalization analysis of EGFR and nearby genes with the brain eQTLs meta-analysis and GTEx brain tissue eQTLs. (B) Colocalization analysis of EGFR with the brain eQTLs meta-analysis. The lead variant rs75061358 and its LD linked variant rs6979446 are highlighted (red and purple, respectively).

Similar articles

Cited by

References

    1. Taliun D., Harris D.N., Kessler M.D., Carlson J., Szpiech Z.A., Torres R., Taliun S.A.G., Corvelo A., Gogarten S.M., Kang H.M., et al. NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. - PMC - PubMed
    1. Bycroft C., Freeman C., Petkova D., Band G., Elliott L.T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. - PMC - PubMed
    1. Schaid D.J., Chen W., Larson N.B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 2018;19:491–504. - PMC - PubMed
    1. Battle A., Brown C.D., Engelhardt B.E., Montgomery S.B., GTEx Consortium. Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group. Statistical Methods groups—Analysis Working Group. Enhancing GTEx (eGTEx) groups. NIH Common Fund. NIH/NCI. NIH/NHGRI. NIH/NIMH. NIH/NIDA. Biospecimen Collection Source Site—NDRI. Biospecimen Collection Source Site—RPCI. Biospecimen Core Resource—VARI. Brain Bank Repository—University of Miami Brain Endowment Bank. Leidos Biomedical—Project Management. ELSI Study. Genome Browser Data Integration &Visualization—EBI. Genome Browser Data Integration &Visualization—UCSC Genomics Institute, University of California Santa Cruz. Lead analysts. Laboratory, Data Analysis &Coordinating Center (LDACC) NIH program management. Biospecimen collection. Pathology. eQTL manuscript working group Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. - PMC - PubMed
    1. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed

Publication types