Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Oct;42(18):e144.
doi: 10.1093/nar/gku739. Epub 2014 Aug 12.

COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification

Affiliations

COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification

Daniel Struck et al. Nucleic Acids Res. 2014 Oct.

Abstract

Viral sequence classification has wide applications in clinical, epidemiological, structural and functional categorization studies. Most existing approaches rely on an initial alignment step followed by classification based on phylogenetic or statistical algorithms. Here we present an ultrafast alignment-free subtyping tool for human immunodeficiency virus type one (HIV-1) adapted from Prediction by Partial Matching compression. This tool, named COMET, was compared to the widely used phylogeny-based REGA and SCUEAL tools using synthetic and clinical HIV data sets (1,090,698 and 10,625 sequences, respectively). COMET's sensitivity and specificity were comparable to or higher than the two other subtyping tools on both data sets for known subtypes. COMET also excelled in detecting and identifying new recombinant forms, a frequent feature of the HIV epidemic. Runtime comparisons showed that COMET was almost as fast as USEARCH. This study demonstrates the advantages of alignment-free classification of viral sequences, which feature high rates of variation, recombination and insertions/deletions. COMET is free to use via an online interface.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
N-ary tree representation of a Markov model, with the context ‘CGT’ highlighted. Each node (circle) has an associated frequency table (box) over the next base in the sequence following the context.
Figure 2.
Figure 2.
Subtype decision tree. The row sums of the log-likelihood matrix provide the overall likelihood of the query sequence to belong to each subtype. These sums are ordered to identify the most likely subtype (S) and the most likely pure subtype (PS). If the query sequence has the highest likelihood of belonging to a pure subtype (i.e. S = PS), this likelihood is challenged against the likelihoods of the sequence to be of any other subtype (other, PURE or CRF) by sliding over the matrix by 100-bp windows with a stepping size of 3 bp. If the difference between the row sums within the current window remains below the recombination threshold (i.e. 28) for each window, the pure subtype is assigned. Otherwise, COMET returns the result ‘UNASSIGNED’. If the query sequence has the highest likelihood of being a CRF, COMET performs a similar challenge, but only against the most likely pure subtype (PS) at first. If this difference remains below the recombination threshold (i.e. 28), COMET assigns the pure subtype (S) with an indication to check for the CRF, indicating a region where the CRF is pure. If the difference is higher than the recombination threshold, a second scan is performed as for the PURE situation, challenging each subtype against the initially assigned CRF.
Figure 3.
Figure 3.
Sensitivities and specificities of COMET, REGAv2 and SCUEAL assessed using the synthetic variation data set spanning the pol region.
Figure 4.
Figure 4.
Agreement between the three subtyping tools on the subtype assigned to clinical patient-derived sequences retrieved from the LANL database. This data set includes 10 625 sequences spanning pol.
Figure 5.
Figure 5.
Sensitivity and specificity of COMET and USEARCH on clinical pol sequences. A data set of 105 752 clinical pol sequences from the LANL HIV database was used for this analysis. The data set includes all PURE and CRF sequences longer than 800 bp, represented by at least 50 sequences, but excludes URFs, which are not classifiable by USEARCH. The USEARCH database was built using the COMET training set. The recombination module of COMET was disabled for this analysis.

Similar articles

Cited by

References

    1. Robertson D.L., Anderson J.P., Bradac J.A., Carr J.K., Foley B., Funkhouser R.K., Gao F., Hahn B.H., Kalish M.L., Kuiken C., et al. HIV-1 nomenclature proposal. Science. 2000;288:55–56. - PubMed
    1. Kanki P.J., Hamel D.J., Sankalé J.L., Hsieh C., Thior I., Barin F., Woodcock S.A., Guèye-Ndiaye A., Zhang E., Montano M., et al. Human immunodeficiency virus type 1 subtypes differ in disease progression. J. Infect. Dis. 1999;179:68–73. - PubMed
    1. Kaleebu P., French N., Mahe C., Yirrell D., Watera C., Lyagoba F., Nakiyingi J., Rutebemberwa A., Morgan D., Weber J., et al. Effect of human immunodeficiency virus (HIV) type 1 envelope subtypes A and D on disease progression in a large cohort of HIV-1-positive persons in Uganda. J. Infect. Dis. 2002;185:1244–1250. - PubMed
    1. Kiwanuka N., Laeyendecker O., Robb M., Kigozi G., Arroyo M., McCutchan F., Eller L.A., Eller M., Makumbi F., Birx D., et al. Effect of human immunodeficiency virus Type 1 (HIV-1) subtype on disease progression in persons from Rakai, Uganda, with incident HIV-1 infection. J. Infect. Dis. 2008;197:707–713. - PubMed
    1. Vasan A., Renjifo B., Hertzmark E., Chaplin B., Msamanga G., Essex M., Fawzi W., Hunter D. Different rates of disease progression of HIV type 1 infection in Tanzania based on infecting subtype. Clin. Infect. Dis. 2006;42:843–852. - PubMed

Publication types