Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Nov 1;27(21):2957-63.
doi: 10.1093/bioinformatics/btr507. Epub 2011 Sep 7.

FLASH: fast length adjustment of short reads to improve genome assemblies

Affiliations

FLASH: fast length adjustment of short reads to improve genome assemblies

Tanja Magoč et al. Bioinformatics. .

Abstract

Motivation: Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome.

Results: We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds.

Availability and implementation: The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash.

Contact: t.magoc@gmail.com.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Distribution of fragment lengths. The horizontal axis shows the fragment length, and the vertical axis shows the number of fragments of a given length. (a) Staphylococcus aureus. (b) Human chromosome 14.
Fig. 2.
Fig. 2.
Distribution of fragment lengths used to generate paired-end reads. The horizontal axis shows the fragment length, and the vertical axis shows the number of fragments of a given length.
Fig. 3.
Fig. 3.
Number of errors in the 1% error rate sample for each position in a read. The horizontal axis shows the read position, and the vertical axis shows the total number of errors at each position summed over the entire set of 1 000 000 pairs.
Fig. 4.
Fig. 4.
Possible outcomes of FLASH for a pair of reads from opposite ends of the same fragment. The two reads are shown in white and black, and the grey region represents their overlap. For overlapping reads, FLASH can merge the pair correctly as shown at the top, or it can fail in two ways: either by failing to merge them or by creating the wrong length overlap. If the reads do not overlap, the correct output will leave them unchanged (a ‘non-merge’).
Fig. 5.
Fig. 5.
Impact of the mismatch ratio parameter on correctness of the read merging algorithm. The horizontal axis shows the number of incorrectly merged read pairs, and the vertical axis shows the number of correctly merged read pairs. The mismatch ratio parameter is shown at each point along the graph.
Fig. 6.
Fig. 6.
Impact of the minimum overlap parameter on correctness of the read merging algorithm. The horizontal axis shows the number of incorrectly merged read pairs, and the vertical axis shows the number of correctly merged read pairs. The minimum overlap value (in base pair) is shown on the graph.
Fig. 7.
Fig. 7.
Illustration of how exact tandem repeats might be collapsed. A and B represent unique sequences flanking R, which is a tandem repeat. On the left (a), R contains multiple identical copies of a the same subsequence. At the top (i) is the original fragment, and just below that (ii) are the two overlapping reads sequenced from each end. The best overlap on the left (iii), shows that the reads overlap too much, which collapses R, eliminating one or more copies of the repeat (iv). On the right (b), the copies are not identical. D is a sequence (as short as one base) that makes one tandem copy different from the others. As a result, the best overlap (iii) produces the correctly merged reads (iv).

Similar articles

Cited by

References

    1. Gnerre S, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA. 2011;108:1513–1518. - PMC - PubMed
    1. Kelley DR, et al. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010;11:R116. - PMC - PubMed
    1. Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. - PMC - PubMed
    1. Langmead B, et al. Ultrafast and memory efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. - PMC - PubMed
    1. Li H, 1000 Genome Project Data Processing Subgroup The sequence alignment/map (SAM) format and SAMtools. Bioinformatics. 2009;25:2078–2079. - PMC - PubMed

Publication types