Article
Published: 28 February 2022

Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads

Nature Biotechnology volume 40, pages 1075–1081 (2022)Cite this article

8798 Accesses
31 Citations
138 Altmetric
Metrics details

Subjects

Genome assembly algorithms

Abstract

Although most existing genome assemblers are based on de Bruijn graphs, the construction of these graphs for large genomes and large k-mer sizes has remained elusive. This algorithmic challenge has become particularly pressing with the emergence of long, high-fidelity (HiFi) reads that have been recently used to generate a semi-manual telomere-to-telomere assembly of the human genome. To enable automated assemblies of long, HiFi reads, we present the La Jolla Assembler (LJA), a fast algorithm using the Bloom filter, sparse de Bruijn graphs and disjointig generation. LJA reduces the error rate in HiFi reads by three orders of magnitude, constructs the de Bruijn graph for large genomes and large k-mer sizes and transforms it into a multiplex de Bruijn graph with varying k-mer sizes. Compared to state-of-the-art assemblers, our algorithm not only achieves five-fold fewer misassemblies but also generates more contiguous assemblies. We demonstrate the utility of LJA via the automated assembly of a human genome that completely assembled six chromosomes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Fig. 3: Chromosome-by chromosome LGA95 (left) and completeness (center) metrics as well as centromere-by-centromere LGA95 metric (right) for LJA (blue), hifiasm (green) and HiCanu (red) assemblies of the T2T read-set.

**Fig. 4: Constructing a multiplex de Bruijn graph.**

Efficient hybrid de novo assembly of human genomes with WENGAN

Article Open access 14 December 2020

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

Article Open access 25 November 2019

Linear time complexity de novo long read genome assembly with GoldRush

Article Open access 22 May 2023

Data availability

All assemblies generated by LJA are available at https://zenodo.org/record/5552696#.YV3MkVNBxH4. All described datasets are publicly available through the corresponding repositories. All HiFi data were obtained from the National Center for Biotechnology Information Sequence Read Archive (SRA). The SRA access codes for all datasets are specified in Supplementary Note 2 ‘Information about datasets’. The CHM13 reference (version 0.9) generated by the T2T consortium (referred to as T2TGenome) can be found at https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.9.fasta.gz.

Code availability

The LJA code uses the open-source libraries spoa (version 4.0.5) and ksw2 (version 4e0a1cc) and is available at https://github.com/AntonBankevich/LJA. All software tools used in the analysis and their versions and parameters are specified in the text of the paper and in Supplementary Note 9 ‘LJA parameters’.

References

Nurk, S. et al. The complete sequence of a human genome. bioRxiv https://doi.org/10.1101/2021.05.26.445798 (2021).
Miller, D. E. et al. Targeted long-read sequencing identifies missing disease-causing variation. Am. J. Hum. Genet. 108, 1436–1449 (2021).
Article CAS Google Scholar
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Article CAS Google Scholar
Compeau, P. E., Pevzner, P. A. & Tesler, G. How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29, 987–991 (2011).
Article CAS Google Scholar
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Article CAS Google Scholar
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Article CAS Google Scholar
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS Google Scholar
Cheng, H. et al. Haplotype-resolved de novo assembly with phased assembly graphs. Nat. Methods 18, 170–175 (2021).
Article CAS Google Scholar
Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).
CAS PubMed Google Scholar
Li, D., Liu, C. M., Luo, R., Sadakane, K. & Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Article CAS Google Scholar
Pevzner, P., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).
Article CAS Google Scholar
Ye, C., Ma, Z. S., Cannon, C. H., Pop, M. & Yu, D. W. Exploiting sparseness in de novo genome assembly. BMC Bioinformatics 13, S1 (2012).
Article Google Scholar
Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).
Article CAS Google Scholar
Rautiainen, M. & Marschall, T. MBG: minimizer-based sparse de Bruijn graph construction. Bioinformatics 37, 2476–2478 (2021).
Article CAS Google Scholar
Bloom, B. H. Space/time tradeoffs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).
Article Google Scholar
Karp, R. M. & Rabin, M. O. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31, 249–260 (1987).
Article Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540 (2019).
Article CAS Google Scholar
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Preprint at https://www.biorxiv.org/content/10.1101/2021.07.02.450803v1 (2021).
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).
Article CAS Google Scholar
Chikhi, R. & Rizk, G. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8, 22 (2013).
Article Google Scholar
Minkin, I., Pham, S. & Medvedev, P. TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics 33, 4024–4032 (2017).
CAS PubMed Google Scholar
Bzikadze, A. V. & Pevzner, P. A. Automated assembly of centromeres from ultra-long error-prone reads. Nat. Biotechnol. 38, 1309–1316 (2020).
Article CAS Google Scholar
Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA—a practical iterative de Bruijn graph de novo assembler. in Research in Computational Molecular Biology. RECOMB 2010. Lecture Notes in Computer Science Vol. 6044, 426–440 (2010).
Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018).
Article CAS Google Scholar
Garg, S. et al. Chromosome-scale, haplotype-resolved assembly of human genomes. Nat. Biotechnol. 39, 309–312 (2021).
Article CAS Google Scholar
Idury, R. M. & Waterman, M. S. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2, 291–306 (1995).
Article CAS Google Scholar
Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).
Article CAS Google Scholar
Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Article CAS Google Scholar
Chin, C. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article CAS Google Scholar
Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).
Article CAS Google Scholar
Roberts, R. J., Carneiro, M. O. & Schatz, M. C. The advantages of SMRT sequencing. Genome Biol. 14, 405 (2013).
Article Google Scholar
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020).
Article CAS Google Scholar
Mikheenko, A., Valin, G., Prjibelski, A., Saveliev, V. & Gurevich, A. Icarus: visualizer for de novo assembly evaluation. Bioinformatics 32, 3321–3323 (2016).
Article CAS Google Scholar
Hon, T. et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 7, 399 (2020).
Article CAS Google Scholar
Tvedte, E. S. et al. Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes. G3 (Bethesda) 11, jkab083 (2021).
Article CAS Google Scholar

Download references

Acknowledgements

A. Bankevich, A. Bzikadze and P.A.P. were supported by National Science Foundation EAGER award 2032783. D.A. was supported by Saint Petersburg State University (grant ID PURE 73023672). We thank A. Korobeynikov and A. Mikheenko for useful suggestions on assembling HiFi reads using SPAdes and benchmarking.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of California, San Diego, San Diego CA, USA
Anton Bankevich & Pavel A. Pevzner
Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego CA, USA
Andrey V. Bzikadze
Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz CA, USA
Mikhail Kolmogorov
Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
Dmitry Antipov

Authors

Anton Bankevich
View author publications
You can also search for this author in PubMed Google Scholar
Andrey V. Bzikadze
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Kolmogorov
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Antipov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel A. Pevzner
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to developing the LJA algorithms and writing the paper. A. Bankevich (jumboDBG and mowerDBG), A. Bzikadze (multiplexDBG) and D.A. (LJApolish) implemented the LJA algorithm. A. Bankevich benchmarked LJA and other assembly tools. A. Bankevich and P.A.P. directed the work.

Corresponding authors

Correspondence to Anton Bankevich or Pavel A. Pevzner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Ergude Bao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–15

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bankevich, A., Bzikadze, A.V., Kolmogorov, M. et al. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol 40, 1075–1081 (2022). https://doi.org/10.1038/s41587-022-01220-6

Download citation

Received: 27 September 2021
Accepted: 11 January 2022
Published: 28 February 2022
Issue Date: July 2022
DOI: https://doi.org/10.1038/s41587-022-01220-6

This article is cited by

Space-efficient computation of k-mer dictionaries for large values of k
- Diego Díaz-Domínguez
- Miika Leinonen
- Leena Salmela
Algorithms for Molecular Biology (2024)
Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph
- Haoyu Cheng
- Mobin Asri
- Heng Li
Nature Methods (2024)
Genome assembly in the telomere-to-telomere era
- Heng Li
- Richard Durbin
Nature Reviews Genetics (2024)
De novo diploid genome assembly using long noisy reads
- Fan Nie
- Peng Ni
- Jianxin Wang
Nature Communications (2024)
Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time
- Sebastian Schmidt
- Jarno N. Alanko
Algorithms for Molecular Biology (2023)

Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads

Subjects

Abstract

Access options

Similar content being viewed by others

Efficient hybrid de novo assembly of human genomes with WENGAN

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

Linear time complexity de novo long read genome assembly with GoldRush

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

Space-efficient computation of k-mer dictionaries for large values of k

Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

Genome assembly in the telomere-to-telomere era

De novo diploid genome assembly using long noisy reads

Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links