Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 26;15(6):R84.
doi: 10.1186/gb-2014-15-6-r84.

LUMPY: a probabilistic framework for structural variant discovery

LUMPY: a probabilistic framework for structural variant discovery

Ryan M Layer et al. Genome Biol. .

Abstract

Comprehensive discovery of structural variation (SV) from whole genome sequencing data requires multiple detection signals including read-pair, split-read, read-depth and prior knowledge. Owing to technical challenges, extant SV discovery algorithms either use one signal in isolation, or at best use two sequentially. We present LUMPY, a novel SV discovery framework that naturally integrates multiple SV signals jointly across multiple samples. We show that LUMPY yields improved sensitivity, especially when SV signal is reduced owing to either low coverage data or low intra-sample variant allele frequency. We also report a set of 4,564 validated breakpoints from the NA12878 human genome. https://github.com/arq5x/lumpy-sv.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The LUMPY framework for integrating multiple structural variation signals. (A) A scenario in which LUMPY integrates three different sequence alignment signals (read-pair, split-read and read-depth) from a genome single sample. Additionally, sites of known variants are provided to LUMPY as prior knowledge in order to improve sensitivity. (B) A single signal type (in this case, read-pair) that is integrated from three different genome samples. We present these as example scenarios and emphasize that multi-signal and multi-sample workflows are not mutually exclusive. CNV, copy number variation.
Figure 2
Figure 2
Performance comparison using homozygous variants of various structural variation types. We simulated a genome with SVs by embedding 2,500 deletions, tandem duplications, inversions or translocations in random locations in the human reference genome. We then simulated sequence data from the altered genome with varying levels of sequence coverage. The performance measurements for LUMPY and DELLY were based on paired-end (pe) and split-read (sr) alignments, GASVPro considered pe and read-depth (rd), and Pindel considered sr alignments. (A) Sensitivity for each tool. LUMPY was the most sensitive in most cases, and had a marked improvement at lower coverage. DELLY detected three more translocations than LUMPY at 20X, at the expense of 93 more false positives. (B) The corresponding FDR for each tool. LUMPY’s FDR was low in all but the highest coverage cases. GASVPro and Pindel did not support tandem duplications, but false calls were made in some cases, which resulted in a 100% FDR. (C) The absolute number of false positive calls. LUMPY had a high number of false positives in some cases, but these are counterbalanced by a higher number of true positives (A). (D,E) To determine the impact that sequence alignment strategies had on SV detection accuracy, LUMPY’s sensitivity (D) and FDR (E) are shown when predicting deletions at 5X coverage via different alignment strategies from the simulations in (A-C). BWA-MEM produces both pe and sr alignment signals in a single alignment step, and serves as a basis of comparison to the NOVOALIGN (pe) and YAHA (sr) strategy. BWA-MEM provides better sensitivity than NOVOALIGN when using the pe signal alone, yet YAHA provides better sensitivity than BWA-MEM when using the sr signal alone. Sensitivity and FDR are roughly equivalent with either the BWA-MEM or NOVOALIGN/YAHA strategies when LUMPY integrates both alignment signals.
Figure 3
Figure 3
Receiver operating characteristic (ROC) curves comparing deletion prediction performance in the NA12878 individual. The relationship between true positive and false positive calls for deletions in the NA12878 genome is given for LUMPY, GASVPro, DELLY, and Pindel. Each point on a given tool’s ROC curve represents a minimum evidence support threshold ranging from 4 to 11 for 5X coverage and 4 to 20 for 50X coverage. Correctness was determined by two different methods: intersection with one of the 3,376 non-overlapping validated deletions from Mills et al. [12], or validation by PacBio/Moleculo data. (A,B) As in Figure 5, prediction performance was measured with both 5X mean genome coverage (A) and 50X coverage (B). The curves are colored following the same convention described in Figure 5. LUMPY outperforms all other tools in all but one case. Pindel slightly outperforms LUMPY at higher-evidence thresholds in the 5X coverage case considering the Mills et al. truth set; we note that this is expected given Pindel was used by the 1000 Genomes Project as one of the tools to define this truth set. At the lower coverage, LUMPY’s performance is boosted by the inclusion of either prior evidence or NA12878’s parental genomes, but the read-depth signal is too weak to offer any improvement. The distinction between tools at 50X coverage is low, but it is expected given the coverage and quality of the data. At higher coverage, LUMPY is able to provide a high-confidence call set when considering read-depth, but priors and parental genomes have little added benefit. pe, paired-end; rd, read-depth; sr, split-read.
Figure 4
Figure 4
Performance comparison for structural variations in a simulated heterogeneous tumor cell population. To measure SV detection performance in the case of a heterogeneous tumor sample, we created a mock tumor genome by embedding 5,516 non-overlapping deletions identified by the 1000 Genomes Project into the human reference genome (build 37). Sequence reads were simulated separately from both the ‘tumor’ genome and the unaltered reference genome. We then mixed reads from both genomes in varying proportions to obtain simulated datasets representing a tumor cell population with different SV allele frequencies. Sequencing coverage levels are shown above each plot, and SV allele frequencies are shown beneath each plot. (A) Sensitivity for detecting SVs at varying allele frequencies and coverage levels. In all cases, LUMPY was more sensitive than GASVPro, DELLY, and Pindel, and showed a marked improvement when the coverage of the ‘tumor’ genome was low owing to either low sequence coverage or low SV allele frequency. In general, to achieve the same level of sensitivity as LUMPY, the other tools required twice the evidence from the ‘tumor’ genome. pe, paired-end; rd, read-depth; sr, split-read. (B) The FDRs for each tool at varying allele frequencies and coverage levels. The FDR for LUMPY was better than all other tools in all cases, with a notable improvement at lower SV allele frequencies. (C) The change in sensitivity when considering two SV detection signals versus a single signal alone is shown for the three tools at 40X coverage and at different SV allele frequencies. At low SV allele frequencies (for example, 5%), LUMPY’s use of two signals (that is, pe + sr) has a super-additive effect on sensitivity relative to either signal alone (that is, pe or sr), whereas the sensitivity of GASVPro and DELLY was either unchanged or modestly improved with one signal versus two.
Figure 5
Figure 5
Performance comparison of deletion detection in high and low coverage Illumina sequencing data from NA12878. We analyzed an approximately 50X coverage dataset of the NA12878 genome from the Illumina Platinum Genomes dataset. We tested LUMPY’s performance under four different variant calling scenarios. First, ‘LUMPY (pe + sr)’ considered both paired-end (pe) and split-read (sr) alignments (using YAHA) from NA12878. Second, ‘LUMPY with prior’ considered pe and sr alignments as well as 1000 Genomes variants as prior evidence. Third, ‘LUMPY trio’ considered pe and sr alignments for NA12878 as well as alignments from her parents (NA12891 and NA12892). Lastly, ‘LUMPY with CNVnator’ integrated pe and sr alignments with copy number loss predictions made by CNVnator (read depth (rd)). DELLY considered pe and sr alignments, GASVPro considered pe alignments and rd, and Pindel considered sr alignments. Sensitivity and FDR were estimated using two truth sets: 3,376 non-overlapping validated deletions from Mills et al. [12], and 4,095 deletions that were predicted by at least one tool and validated by PacBio or Moleculo alignments. (A) SV detection sensitivity and FDR on a 5X coverage subsample of the original data. LUMPY pe + sr was more sensitive than both GASVPro and Pindel and had either an equivalent or better FDR. DELLY was more sensitive than LUMPY pe + sr, but also had a higher FDR. Prior evidence or parental genomes improved LUMPY sensitivity. Given the low coverage, the read-depth signal was weak and only a small number of CNVs clustered with paired-end or split-read calls. (B) SV detection sensitivity and FDR on the original 50X coverage data. LUMPY pe + sr, DELLY, and Pindel had similar sensitivity in the Mills et al. truth set, and in the PacBio/Moleculo truth set DELLY had the highest sensitivity and FDR. LUMPY pe + sr had the next best sensitivity and the lowest FDR.
Figure 6
Figure 6
Relationship between paired-end and split-read signals for the NA12878 callset. (A) Venn diagram showing the total number of calls identified by paired-end alignments alone (left), by split-read alignments alone (right), or by both (center). Shown are the total number of calls, the sensitivity, and the FDR. Sensitivity and FDR are calculated precisely as in Figure 5. (B) Scatter plots showing the relationship between the number of split-reads (y-axis) and paired-end reads (x-axis) that identify each SV breakpoint in the entire callset (left), the unvalidated SV calls (center) and the validated SV calls (right). The number of variants in each category and the R2 values are shown above each plot. Note that one unvalidated call is not visible in these plots due to cropping; it was identified by 236 split reads and 0 paired-end reads.
Figure 7
Figure 7
Detection performance in the NA12878 individual when restricting false discovery rates. We compared the performance of each tool in terms of sensitivity and novel variant discovery ability when considering only the subset of calls that meet a maximum FDR threshold. Using the results given in Figure 6, each tool’s FDR was calculated for each of the minimum-evidence settings used to generate the respective receiver operating characteristic (ROC) curves. This provided a mapping from the maximum FDR to the subset of calls that meet the associated minimum-evidence threshold for each tool. Sensitivity and FDR were estimated using the 4,095 deletions that were predicted by at least one tool and validated by PacBio or Moleculo alignments. (A) Sensitivity given a maximum FDR threshold. At 5X coverage, an FDR threshold of approximately 10% is achieved with a minimum of four alignments for LUMPY (8.1% FDR), four for GASVPro (10.1% FDR), six for DELLY (11.3% FDR), and nine for Pindel (6.3% FDR). An approximately 20% FDR at 50X coverage requires 8 alignments for LUMPY (18% FDR), 16 for GASVPro (19% FDR), 12 for DELLY (17.6% FDR), and 20 for Pindel (18.8% FDR). LUMPY had the highest sensitivity at both coverage levels and the relative improvement was most substantial at lower coverage. (B) Venn diagrams reflecting the absolute number of variants discovered uniquely and jointly among the different tools at both 10% FDR for 5X and 20% FDR for 50X. In both cases LUMPY found the most number of unique variants. The difference was most dramatic in the 5X coverage experiment, where only 46 out of 665 (6.9%) of the variants found among all four tools were missed by LUMPY. pe, paired-end; rd, read-depth; sr, split-read.
Figure 8
Figure 8
Breakpoint interval size distributions for structural variation calls in NA12878. LUMPY refines the location of a given breakpoint by taking the product of the probability distributions in the breakpoint’s evidence set. The shape of each distribution depends on the breakpoint uncertainty that is inherent to the evidence signal type (for example, the spatial uncertainty of breakpoints predicted by paired-end alignments is much higher than with split-read alignments). (A) The distribution of predicted breakpoint intervals for SV calls when using solely paired-end alignments. The variability in fragment size causes a significant amount of uncertainty in the paired-end signal, which results in a wide (over 500 bases for the NA12878 sample) distribution in the predicted breakpoint intervals. (B) The distribution of predicted breakpoint intervals for SV calls when using solely split-read alignments. Split-read alignments inherently have far less uncertainty in the predicted breakpoint location and, therefore, they yield a distribution with much lower variance. (C) The resulting breakpoint uncertainty distribution when both paired-end and split-read alignments are jointly considered. By taking the product of the distributions, the inherent breakpoint precision afforded by split-read alignments is not substantially diluted by paired-end alignments. (D) A comparison of the predicted breakpoint intervals reported by GASVPro (left) all LUMPY calls (center), and the 95% confidence interval for the LUMPY calls (right). Size distributions are not shown for DELLY or Pindel since they only report single base coordinates. stdev, standard deviation.

Similar articles

Cited by

References

    1. Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12:363–376. - PMC - PubMed
    1. Medvedev P, Stanciu M, Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009;6:S13–S20. - PubMed
    1. Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–i339. - PMC - PubMed
    1. Sindi SS, Onal S, Peng LC, Wu HT, Raphael BJ. An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol. 2012;13:R22. - PMC - PubMed
    1. Handsaker RE, Korn JM, Nemesh J, McCarroll SA. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet. 2011;43:269–276. - PMC - PubMed

Publication types