Abstract
Tandem repeat (TR) variation is associated with gene expression changes and numerous rare monogenic diseases. Although long-read sequencing provides accurate full-length sequences and methylation of TRs, there is still a need for computational methods to profile TRs across the genome. Here we introduce the Tandem Repeat Genotyping Tool (TRGT) and an accompanying TR database. TRGT determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It also reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool. Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. In six samples with known repeat expansions, TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution than existing methods. Additionally, we released a database with allele sequences and methylation levels for 937,122 TRs across 100 genomes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
PacBio Revio sequencing of HG002, HG003 and HG004 samples has been deposited to the Sequence Read Archive (SRA)82. Version 0.7 of the HG002 assembly from the Telomere-to-Telomere Consortium was downloaded from GitHub40,83. The data created as part of Genomic Answers for Kids are available through NIH/NCBI dbGAP, accession number phs002206 (ref. 84). Human Pangenome Reference Consortium data are available at the SRA under BioProject ID PRJNA850430 (ref. 85) and the AWS Registry of Open Data86. The short-read data for HG002, HG003 and HG004 are available from the 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7 within the AWS Registry of Open Data87. TRGT repeat catalogs and TRGTdb for 100 HPRC samples have been deposited into a dedicated Zenodo repository88.
Code availability
The source code of TRGT, TRVZ and TRGTDB is available on GitHub64.
References
English, A. et al. Benchmarking of small and large variants across tandem repeats. Preprint at bioRxiv https://doi.org/10.1101/2023.10.29.564632 (2023).
Caron, N. S., Wright, G. E. B. & Hayden, M. R. Huntington disease. In GeneReviews® (eds. Adam, M. P. et al.) (Univ. Washington, 1998).
Siddique, N. & Siddique, T. Amyotrophic lateral sclerosis overview. In GeneReviews® (eds. Adam, M. P. et al.) (Univ. Washington, 2001).
Hunter, J. E., Berry-Kravis, E., Hipp, H. & Todd, P. K. FMR1 disorders. In GeneReviews® (eds. Adam, M. P. et al.) (Univ. Washington, 1998).
Gymrek, M. et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet. 48, 22–29 (2016).
Erwin, G. S. et al. Recurrent repeat expansions in human cancer genomes. Nature 613, 96–102 (2023).
Li, K., Luo, H., Huang, L., Luo, H. & Zhu, X. Microsatellite instability: a review of what the oncologist should know. Cancer Cell Int. 20, 16 (2020).
Trost, B. et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature 586, 80–86 (2020).
Mojarad, B. A. et al. Genome-wide tandem repeat expansions contribute to schizophrenia risk. Mol. Psychiatry 27, 3692–3698 (2022).
Morales, F. et al. Somatic instability of the expanded CTG triplet repeat in myotonic dystrophy type 1 is a heritable quantitative trait and modifier of disease severity. Hum. Mol. Genet. 21, 3558–3567 (2012).
Morales, F. et al. Longitudinal increases in somatic mosaicism of the expanded CTG repeat in myotonic dystrophy type 1 are associated with variation in age-at-onset. Hum. Mol. Genet. 29, 2496–2507 (2020).
Overend, G. et al. Allele length of the DMPK CTG repeat is a predictor of progressive myotonic dystrophy type 1 phenotypes. Hum. Mol. Genet. 28, 2245–2254 (2019).
Press, M. O., Carlson, K. D. & Queitsch, C. The overdue promise of short tandem repeat variation for heritability. Trends Genet. 30, 504–512 (2014).
Payseur, B. A., Place, M. & Weber, J. L. Linkage disequilibrium between STRPs and SNPs across the human genome. Am. J. Hum. Genet. 82, 1039–1050 (2008).
Zhou, Y. et al. Robust fragile X (CGG)n genotype classification using a methylation specific triple PCR assay. J. Med. Genet. 41, e45 (2004).
Tarleton, J. Detection of FMR1 trinucleotide repeat expansion mutations using Southern blot and PCR methodologies. In Neurogenics: Methods and Protocols (ed. Potter, N. T.) 29–39 (Springer, 2003).
Rajan-Babu, I. S., Law, H. Y., Yoon, C. S., Lee, C. G. & Chong, S. S. Simplified strategy for rapid first-line screening of fragile X syndrome: closed-tube triplet-primed PCR and amplicon melt peak analysis. Expert Rev. Mol. Med. 17, e7 (2015).
Gymrek, M., Golan, D., Rosset, S. & Erlich, Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 22, 54–62 (2012).
Willems, T. et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods 14, 590–592 (2017).
Dolzhenko, E. et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 27, 1895–1903 (2017).
Dashnow, H. et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 19, 121 (2018).
Mousavi, N., Shleizer-Burko, S., Yanicky, R. & Gymrek, M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47, e90 (2019).
Dolzhenko, E. et al. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35, 4754–4756 (2019).
Dolzhenko, E. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol. 21, 102 (2020).
Dashnow, H. et al. STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci. Genome Biol. 23, 257 (2022).
Hannan, A. J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 19, 286–298 (2018).
Ibañez, K. et al. Whole genome sequencing for the diagnosis of neurological repeat expansion disorders in the UK: a retrospective diagnostic accuracy and prospective clinical validation study. Lancet Neurol. 21, 234–245 (2022).
Giesselmann, P. et al. Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing. Nat. Biotechnol. 37, 1478–1481 (2019).
Mitsuhashi, S. et al. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol. 20, 58 (2019).
Chiu, R., Rajan-Babu, I. S., Friedman, J. M. & Birol, I. Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences. Genome Biol. 22, 224 (2021).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Coster, W. D., De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing [Internet]. Nat. Rev. Genet. 22, 572–587 (2021).
Oostra, B. A. & Willemsen, R. FMR1: a gene with three faces. Biochim. Biophys. Acta 1790, 467–477 (2009).
Roy, S. et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J. Mol. Diagn. 20, 4–27 (2018).
Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. & Bafna, V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 28, 1709–1719 (2018).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics. 27, 2156–2158 (2011).
English, A. Project Adotto Tandem-Repeat Regions and Annotations. Zenodo https://doi.org/10.5281/zenodo.7013709 (2022).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).
Tsai, Y. C. et al. Amplification-free, CRISPR–Cas9 targeted enrichment and SMRT sequencing of repeat-expansion disease causative genomic regions. Preprint at bioRxiv https://doi.org/10.1101/203919 (2017).
Grosso, V. et al. Characterization of FMR1 repeat expansion and intragenic variants by indirect sequence capture. Front. Genet. 12, 743230 (2021).
Mousavi, N. et al. TRTools: a toolkit for genome-wide analysis of tandem repeats. Bioinformatics 37, 731–733 (2020).
Ziaei Jam, H. et al. A deep population reference panel of tandem repeat variation. Nat. Commun. 14, 6711 (2023).
Dreos, R., Ambrosini, G., Cavin Périer, R. & Bucher, P. EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Nucleic Acids Res. 41, D157–D164 (2013).
Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).
Vavouri, T. & Lehner, B. Human genes with CpG island promoters have a distinct transcription-associated chromatin organization. Genome Biol. 13, R110 (2012).
Takai, D. & Jones, P. A. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl Acad. Sci. USA 99, 3740–3745 (2002).
Rafehi, H. et al. Bioinformatics-based identification of expanded repeats: a non-reference intronic pentamer expansion in RFC1 causes CANVAS. Am. J. Hum. Genet. 105, 151–165 (2019).
Cortese, A. et al. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat. Genet. 51, 649–658 (2019).
Akçimen, F. et al. Investigation of the RFC1 repeat expansion in a Canadian and a Brazilian ataxia cohort: identification of novel conformations. Front. Genet. 10, 1219 (2019).
Fan, Y. et al. No biallelic intronic AAGGG repeat expansion in RFC1 was found in patients with late-onset ataxia and MSA. Parkinsonism Relat. Disord. 73, 1–2 (2020).
Hagerman, R. J. et al. Fragile X syndrome. Nat. Rev. Dis. Primers 3, 17065 (2017).
Yrigollen, C. M. et al. AGG interruptions and maternal age affect FMR1 CGG repeat allele stability during transmission. J. Neurodev. Disord. 6, 24 (2014).
Huang, W. et al. Distribution of fragile X mental retardation 1 CGG repeat and flanking haplotypes in a large Chinese population. Mol. Genet. Genomic Med. 3, 172–181 (2015).
Depienne, C. & Mandel, J. L. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am. J. Hum. Genet. 108, 764–785 (2021).
Ashley, E. A. Towards precision medicine. Nat. Rev. Genet. 17, 507–572 (2016).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Ward Jr, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
TRGTdb tutorial. https://github.com/ACEnglish/trgt/blob/main/tdb_tutorial.md
Stovner, E. B. & Sætrom, P. PyRanges: efficient comparison of genomic intervals in Python. Bioinformatics 36, 918–919 (2020).
ACEnglish/trgt. https://github.com/ACEnglish/trgt/tree/main/notebooks
Dolzhenko, E. et al. TRGT: tandem repeat genotyper. Github https://github.com/PacificBiosciences/trgt/ (2023).
Index of /ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/LowComplexity. https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/LowComplexity/
Table Browser. https://genome.ucsc.edu/cgi-bin/hgTables
Repeats. http://useast.ensembl.org/info/genome/genebuild/assembly_repeats.html
Bakhtiari, M., Park, J., Javadzadeh, S., Homer, N. & De Coster, W. A tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data. Github https://github.com/mehrdadbakhtiari/adVNTR (2023).
Qiu, Y. J., Deshpande, V., Avdeyev, P., Dolzhenko, E. & Eberle, M. A. Illumina/RepeatCatalogs. Github https://github.com/Illumina/RepeatCatalogs (2023).
Lucas, J., Li, H. & Jeltje human-pangenomics/HPP_Year1_Assemblies. Assemblies from HPP Year 1 production. Github https://github.com/human-pangenomics/HPP_Year1_Assemblies (2023).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Cohen, A. S. A. et al. Genomic answers for children: dynamic analyses of >1000 pediatric rare disease genomes. Genet. Med. 24, 1336–1348 (2022).
Cheung, W. A. et al. Direct haplotype-resolved 5-base HiFi sequencing for genome-wide profiling of hypermethylation outliers in a rare disease cohort. Nat. Commun. 14, 3090 (2023).
Pedersen, B. S. et al. Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches. Genome Med. 12, 62 (2020).
Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).
Töpfer, A. et al. PacificBiosciences/pbmm2. A minimap2 frontend for PacBio native data formats. Github https://github.com/PacificBiosciences/pbmm2 (2023).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Granger, B. E. & Perez, F. Jupyter: thinking and storytelling with code and data. Comput. Sci. Eng. 23, 7–14 (2021).
pandas-dev/pandas: Pandas. Zenodo https://doi.org/10.5281/zenodo.10045529 (2023).
Homo sapiens (human): WGS of GIAB HG002-4 trio with PacBio HiFi. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1028149 (2023).
Hansen, N. F., Phillippy, A., Koren, S. & Walenz, B. Telomere-to-telomere consortium HG002 ‘Q100’ project. Github https://github.com/marbl/hg002 (2023).
Genomic Answers for Kids (GA4K). dbGaP. https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002206.v4.p1
Homo sapiens: Human Pangenome Reference Consortium (HPRC). https://www.ncbi.nlm.nih.gov/bioproject/730823 (2021).
Human PanGenomics Project. https://registry.opendata.aws/hpgp-data/
1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7. https://registry.opendata.aws/ilmn-dragen-1kgp/
Dolzhenko, E. & English, A. Repeat catalogs for TRGT. Zenodo https://doi.org/10.5281/zenodo.8329210 (2023).
Acknowledgements
We would like to thank M. Gymrek, I. Deveson and the anonymous reviewers for helping us to substantially improve the manuscript and TRGT. We are grateful to the Telomere-to-Telomere Consortium, the Human Pangenome Reference Consortium and the Genome in a Bottle Consortium for releasing datasets essential for this study. We would also like to acknowledge many TRGT users who provided valuable feedback that helped us to substantially improve the tool. We thank generous donors to the Genomic Answers for Kids program at Children’s Mercy Kansas City. A.E. was supported by grant HHSN268201800002I. H.D. was supported by grants K99HG012796 and 5T32HG008962-07. P.J. was supported by grants NS111602, HD104458 and HD104463. D.L.N. was supported by grants HD104463, NS051630 and HD103555. S.Z. was supported by grant 2R01NS072248. T.P. was supported by grant UL1TR002366. A.R.Q. was supported by grant R01HG010757. F.J.S. was supported by grants 1U01HG011758-01, 3OT2OD002751 and 1UG3NS132105-01.
Author information
Authors and Affiliations
Contributions
E.D. and M.A.E. devised and implemented the initial versions of TRGT and TRVZ. A.E. and F.J.S. implemented TRGTdb. H.D. performed analysis of samples with known expansions, in collaboration with W.A.C., C.B., E.F. and T.P. H.D., W.J.R., Z.K. and A.W. guided the development of TRGT. G.D.S.B., E.D., H.D. and M.C.D. performed benchmarking analyses. T.M. and G.D.S.B. contributed major improvements to the TRGT source code. E.D., H.D., A.E., G.D.S.B. and T.M. performed TR analyses in the HPRC samples. V.M.-C., T.D.B., P.J. and D.L.N. generated sequencing from prefrontal cortex samples of individuals with FMR1 expansions. M.A.E., F.J.S., A.R.Q., T.P. and S.Z. provided guidance and supervision. E.D., A.E., H.D., F.J.S. and M.A.E. wrote the manuscript, with assistance from C.K., K.P.C., W.J.R., Z.K., A.W. and A.R.Q. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
E.D., G.D.S.B., T.M., W.J.R., C.K., Z.K., K.P.C., A.W. and M.A.E. are employees and shareholders of Pacific Biosciences. F.J.S. received research support from Illumina, Pacific Biosciences, Nanopore and Genentech. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Ira Deveson, Melissa Gymrek and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–10, Supplementary Tables 1 and 2 and Supplementary Note
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dolzhenko, E., English, A., Dashnow, H. et al. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-023-02057-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-023-02057-3
This article is cited by
-
LongTR: genome-wide profiling of genetic variation at tandem repeats from long reads
Genome Biology (2024)
-
Analysis and benchmarking of small and large genomic variants across tandem repeats
Nature Biotechnology (2024)
-
Sequence composition changes in short tandem repeats: heterogeneity, detection, mechanisms and clinical implications
Nature Reviews Genetics (2024)
-
A common flanking variant is associated with enhanced stability of the FGF14-SCA27B repeat locus
Nature Genetics (2024)