This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Mar 26:2024.03.22.24304565.

doi: 10.1101/2024.03.22.24304565.

Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease

Tanner D Jensen¹, Bohan Ni², Chloe M Reuter³, John E Gorzynski¹, Sarah Fazal⁴, Devon Bonner³, Rachel A Ungar¹, Pagé C Goddard¹, Archana Raja¹, Euan A Ashley¹, Jonathan A Bernstein¹, Stephan Zuchner¹; Undiagnosed Diseases Network; Michael D Greicius¹, Stephen B Montgomery¹, Michael C Schatz², Matthew T Wheeler⁵, Alexis Battle²

Affiliations

¹ Stanford University.
² Johns Hopkins University.
³ Stanford Health Care.
⁴ University of Miami.
⁵ Stanford Center for Inherited Cardiovascular Disease.

PMID: 38585781
PMCID: PMC10996727
DOI: 10.1101/2024.03.22.24304565

Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease

Tanner D Jensen et al. medRxiv. 2024.

[Preprint]. 2024 Mar 26:2024.03.22.24304565.

doi: 10.1101/2024.03.22.24304565.

Authors

Affiliations

¹ Stanford University.
² Johns Hopkins University.
³ Stanford Health Care.
⁴ University of Miami.
⁵ Stanford Center for Inherited Cardiovascular Disease.

PMID: 38585781
PMCID: PMC10996727
DOI: 10.1101/2024.03.22.24304565

Abstract

Rare structural variants (SVs) - insertions, deletions, and complex rearrangements - can cause Mendelian disease, yet they remain difficult to accurately detect and interpret. We sequenced and analyzed Oxford Nanopore long-read genomes of 68 individuals from the Undiagnosed Disease Network (UDN) with no previously identified diagnostic mutations from short-read sequencing. Using our optimized SV detection pipelines and 571 control long-read genomes, we detected 716 long-read rare (MAF < 0.01) SV alleles per genome on average, achieving a 2.4x increase from short-reads. To characterize the functional effects of rare SVs, we assessed their relationship with gene expression from blood or fibroblasts from the same individuals, and found that rare SVs overlapping enhancers were enriched (LOR = 0.46) near expression outliers. We also evaluated tandem repeat expansions (TREs) and found 14 rare TREs per genome; notably these TREs were also enriched near overexpression outliers. To prioritize candidate functional SVs, we developed Watershed-SV, a probabilistic model that integrates expression data with SV-specific genomic annotations, which significantly outperforms baseline models that don't incorporate expression data. Watershed-SV identified a median of eight high-confidence functional SVs per UDN genome. Notably, this included compound heterozygous deletions in FAM177A1 shared by two siblings, which were likely causal for a rare neurodevelopmental disorder. Our observations demonstrate the promise of integrating long-read sequencing with gene expression towards improving the prioritization of functional SVs and TREs in rare disease patients.

PubMed Disclaimer

Conflict of interest statement

COMPETING INTEREST STATEMENT SBM is an advisor to BioMarin, Myome and Tenaya Therapeutics. AB is a co-founder of CellCipher, Inc, is a shareholder in Alphabet, Inc, and has consulted for Third Rock Ventures, LLC. EAA is the founder of Personalis, Deepcell, Svexa, RCD Co, Parameter Health, an advisor for SequenceBio, Foresite Labs, PacBio, a non-executive director at AstraZeneca, hold stocks in Oxford Nanopore, Pacific Biosciences, AstraZeneca, and offers collaborative support in kind to Illumina, Pacific Biosciences, Oxford Nanopore

Figures

**Figure 1. Undiagnosed patient cohort description and pipeline overview. Cohort Description:**
A Patients were recruited from the Undiagnosed Disease Network for a long read sequencing (LRS) study. These included 57 affected individuals and 11 unaffected family members from a wide range of primary symptom categories, including Neurology, musculoskeletal, and cardiology. Patients had previous short-read genetic testing with Illumina that was negative or inconclusive. **B Long-read Pipeline Overview**: individuals were sequenced on R9.4 flowcells on the ONT PromethION. Consensus structural variants were called by merging SVs across individual callers and keeping those that showed multi-algorithm support. A population merge of the UDN genomes together with Stanford ADRC population reference of 579 nanopore genomes, allowed ascertainment of robust allele frequencies for structural variants. Rare structural variants were filtered and intersected with overlapping genome annotations to input into Watershed. Vamos was used on a catalog of polymorphic tandem repeats to genotype tandem repeat copy numbers. A mean neighbor distance based outlier calling method was used to define extreme repeat expansions. C RNA-sequencing expression outlier pipeline: transcriptome data from the UDN was processed by quantifying expression, combining with tissue-matched controls from GTEx, normalizing for library size and composition bias, and correcting for batch effects and hidden factors. Expression outliers of the normalized data were input into Watershed. D Watershed-SV integrates signals from rare SVs and overlapping genome annotations to predict variants with large functional effects. High scoring watershed variants are prioritized and curated per patient for disease relevance.

**Figure 2. Long-read sequencing detects rare structural variants and extreme tandem repeat expansions.**
A Length distribution of deletions and insertions detected by each technology on a log-log axis. SVs were called with a consensus SV calling pipeline including SVIM, cuteSV, and sniffles2 for long reads and MantaSV calls were genotyped with paragraph for short reads. Dashed line represents 50bp, the threshold for calling an indel an SV. B Mean tandem repeat copy numbers estimated from the UDN genomes stratified by repeat motif length. Short tandem Repeats (STR) have repeat motifs between 2–6bp. Variable number tandem repeats (VNTR) have repeat motifs greater or equal to 7 bp. Vamos was used to genotype tandem repeat copy number in long reads and ExpansionHunter was used in short reads. Each tool used a different tandem repeat loci catalog to define TRs. Counts of TRs by repeat motif length bins present in the tools respective catalog is also plotted. C Allele frequency distribution of long-read discovered SVs from jasmine-SV merge with ADRC genomes. ADRC provided a reference sample of 600 nanopore genomes to allow robust estimation of minor allele frequencies. D Count of rare SVs (MAF < 0.01), detected per individual stratified by SV Type and Technology. Short read SVs were annotated with allele frequencies using SVAFotate and lookup in gnomAD, CCDG, and 1000G. E Count of extreme tandem repeat expansion (TRE) detected per individual. Extreme tandem repeat expansion outliers in each technology were called by jointly estimating repeat copy number distribution of long read vamos calls with the ADRC and of short-read ExpansionHunter calls with 1000G, and then calculating for each allele its average distance from its k nearest neighbors. Extreme TREs were defined as alleles with a standardized mean neighbor distance greater than 2, with k = 5 for long read and k = 25 for short read.

**Figure 3. Rare long-read-discovered SVs are strongly enriched proximal to gene expression outliers**
A enrichment of rare structural variants, stratified by type, within 10kb of an expression outlier gene given the specified absolute Z-score threshold. Estimate of log odds ratio plotted with error bars representing standard errors of the estimate. B Directional enrichment for rare SVs within 10kb of either over (Z > 4) or under (Z < −4) expression outliers. Model for tandem repeat expansion (TRE) nearby underexpression outliers did not converge due to the lack of examples of rare TREs near underexpression outliers, so was not be plotted. C Enrichment of rare SVs, across all SV types, within 100kb of expression outliers, stratified by genome and variant annotation categories. Gene body position displays enrichment of VEP annotated categories for SV location relative to the gene body of the expressed gene. If an SV overlaps multiple categories it is assigned to the one with highest priority given the following ordering: CDS, 5’UTR, 3’UTR, intron, upstream noncoding, downstream noncoding. SV length and CADDSV deleteriousness display enrichment of rare SVs with length and CADDSV score respectively above the specified threshold. VEP impact displays enrichment of rare SVs with the given VEP impact category, where HIGH represents predicted loss-of-function variants. Finally, we display enrichment of SVs that overlap with noncoding regulatory annotations, including if it overlaps an ABC regulatory element linked to the expressed gene, a conserved transcription factor binding site (TFBS), a high-density of ChIP-seq peaks defining conserved regulatory modules (CRM) from ReMap, a TAD boundary detected in multiple cell types, highly constrained LINSIGHT SNVs, or a highly conserved region by phastCON. We also display enrichments for SVs that overlapped any one of these annotations (putative regulatory SVs) and for SVs that do not overlap with any of these annotations (putative non-regulatory).

**Figure 4.. Watershed-SV improves prioritization of rare SVs in healthy and muscular dystrophy cohort.**
A Precision-Recall Curves (PRC) of benchmark using held-out N2 pairs; We ran multi-tissue Watershed-SV using both 10kb (solid) and 100kb (dashed) distance limit as well as WGS-only model with the same setup. B top positive genomic annotation effect sizes (β) for 7 major categories of the 10kb multi-tissue Watershed-SV model. C Using a z-score threshold of −3 and 3, we stratified 100kb multi-tissue Watershed-SV model prediction on CMG muscular disorder dataset posterior probabilities by under-, over-, and non-outliers (column), and then by coding vs noncoding variants (row); each dot represent an gene-SV pair. D top positive genomic annotation effect sizes for 100kb multi-tissue Watershed-SV model. 7 annotation categories are grouped into region-specific (TSS/upstream Flank, Gene Body, TES/downstream Flank) and region-agnostic features. Region specific features are separately aggregated for each SV, then collapsed to each gene by regions.

**Figure 5.. Watershed-SV prioritizes symptom-relevant functional rare SVs from UDN LRS dataset.**
A Swarmplot for number of gene-SV pairs prioritized per individuals in the UDN LRS dataset under different set of combined filters. There are 4 filter categories: WGS-only filters, WGS + HPO filters, WGS + RNA filters, and WGS + RNA + HPO filters, in increasing level of stringency due to increasing types of filters jointly applied; red dot represent the mean number of gene-SV pairs across individuals, red horizontal line represent standard deviation; x-axis is in log2 scale; the bar plot on the right shows number of samples with significant prioritizations. B Upset plot depicting number of gene-SV pairs prioritized by Watershed-SV (posterior > 0.6), CADD-SV (score > 10), and whether the SV is uniquely identified using LRS. **C and E** Case example 1, rare TREs shared by both siblings, and case example 2, rare compound heterozygous deletions in siblings. Lollipop plot shows which set of filter includes the candidate diagnostic gene-SV pair (Triangle) and which does not (Circle), height of the lollipop represents number of gene-SV pairs prioritized in log2 scale. D Panels depict the TR copy numbers of the siblings and unaffected parent with less-expanded allele. The TRE loci is in 5’ UTR of FAM193B. Both Watershed-SV and CADD-SV can prioritize this but not WGS-only model. Both siblings have extremely high overexpression z-scores. F Panels depict the compound heterozygous deletions phased onto both alleles for FAM177A1, causing LOF of gene and thereby underexpression outliers. Only Watershed-SV succeeded at prioritizing both variants.

See this image and copyright information in PMC

References

1. Kovaka S, Ou S, Jenike KM, Schatz MC. Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing. Nat Methods. 2023;20: 12–16. - PMC - PubMed
1. Mahmoud M, Gobet N, Cruz-Dávalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. Structural variant calling: the long and the short of it. Genome Biol. 2019;20: 246. - PMC - PubMed
1. Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, et al. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019;176: 663–675.e19. - PMC - PubMed
1. Chaisson MJP, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 2019;10: 1784. - PMC - PubMed
1. Sanford Kobayashi E, Batalov S, Wenger AM, Lambert C, Dhillon H, Hall RJ, et al. Approaches to long-read sequencing in a clinical setting to improve diagnostic rate. Sci Rep. 2022;12: 16945. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease

Affiliations

Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

This is a preprint.

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Related information

Grants and funding

LinkOut - more resources

Full Text Sources