Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2018 May 1;8(1):6800.
doi: 10.1038/s41598-018-25090-8.

Comparative analyses of whole-genome protein sequences from multiple organisms

Affiliations
Comparative Study

Comparative analyses of whole-genome protein sequences from multiple organisms

Makio Yokono et al. Sci Rep. .

Abstract

Phylogenies based on entire genomes are a powerful tool for reconstructing the Tree of Life. Several methods have been proposed, most of which employ an alignment-free strategy. Average sequence similarity methods are different than most other whole-genome methods, because they are based on local alignments. However, previous average similarity methods fail to reconstruct a correct phylogeny when compared against other whole-genome trees. In this study, we developed a novel average sequence similarity method. Our method correctly reconstructs the phylogenetic tree of in silico evolved E. coli proteomes. We applied the method to reconstruct a whole-proteome phylogeny of 1,087 species from all three domains of life, Bacteria, Archaea, and Eucarya. Our tree was automatically reconstructed without any human decisions, such as the selection of organisms. The tree exhibits a concentric circle-like structure, indicating that all the organisms have similar total branch lengths from their common ancestor. Branching patterns of the members of each phylum of Bacteria and Archaea are largely consistent with previous reports. The topologies are largely consistent with those reconstructed by other methods. These results strongly suggest that this approach has sufficient taxonomic resolution and reliability to infer phylogeny, from phylum to strain, of a wide range of organisms.

PubMed Disclaimer

Conflict of interest statement

M.Y. is employee of Nippon Flour Mills Co., Ltd. Part of the calculation was done by a computer lent from Hokkaido University to Nippon Flour Mills Co., Ltd., Innovation Center for free.

Figures

Figure 1
Figure 1
(a) Artificial evolution of all open reading frames from Escherichia coli 536. Thirty-two genomes after the fifth generation were used to build the phylogenetic trees shown in b–f. (b,c) Reconstructed trees using all genes evolved in silico using our average sequence similarity method. (b) Distance matrix was constructed with D. (c) Distance matrix was constructed with C. (d–f) Reconstructed trees using a single gene. (d) Tree reconstructed from mutated ECP_0844 genes (404 amino acid length) using our average sequence similarity method developed in this study. (e) Similarity dendrogram constructed from mutated ECP_0844 genes, or (f) mutated ECP_0843 genes (97 amino acid length) using the multiple sequence alignment program ClustalX v. 2.1.
Figure 2
Figure 2
Phylogenetic tree of 1,087 species reconstructed from a comparison of all protein sequences from all the species. Branch colors reflect taxonomic information (division) obtained from the NCBI Website.
Figure 3
Figure 3
Smaller version of Fig. 2. Species within each clade were collapsed according to taxonomic information (division).
Figure 4
Figure 4
Phylogenetic tree of 116 species reconstructed from a comparison of all protein sequences using the fitch-margoliash method. The 116 species are constructed with random and equal sampling from latest genomes from the three domains. An inset is polar tree layout.
Figure 5
Figure 5
Representation of best-matched proteins on a two-dimensional display. The vertical axes represent the logarithmic E-values (E) of the best-matched proteins of Arabidopsis to the proteins of Chlamydomonas. The horizontal axes represent the logarithmic E-values of the best-matched proteins of Arabidopsis to Arabidopsis (Ebest). In this case, the best-matched proteins are identical to the query proteins. The plot was fitted with a straight line from the origin (red line). We estimated the average of T from the slope of the line.

Similar articles

Cited by

References

    1. Woese CR, Kandler O, Wheelis ML. Towards a natural system of organisms: proposal for the domainsArchaea, Bacteria, and Eucarya. Proceedings of the National Academy of Sciences. 1990;87:4576–4579. doi: 10.1073/pnas.87.12.4576. - DOI - PMC - PubMed
    1. Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ. Universal trees based on large combined protein sequence data sets. Nature genetics. 2001;28:281–285. doi: 10.1038/90129. - DOI - PubMed
    1. Swingley WD, Blankenship RE, Raymond J. Integrating Markov clustering and molecular phylogenetics to reconstruct the cyanobacterial species tree from conserved protein families. Molecular biology and evolution. 2008;25:643–654. doi: 10.1093/molbev/msn034. - DOI - PubMed
    1. Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nature Reviews Microbiology. 2005;3:679–687. doi: 10.1038/nrmicro1204. - DOI - PubMed
    1. Dagan T, Artzy-Randrup Y, Martin W. Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proceedings of the National Academy of Sciences. 2008;105:10039–10044. doi: 10.1073/pnas.0800679105. - DOI - PMC - PubMed

Publication types

LinkOut - more resources