Background & Summary

Insects are known to be associated with multiple symbionts for acquiring unique and beneficial functions1 during the 480 million years of evolutionary history2,3. Symbionts can help the host insect in better adapting to complex and dynamic ecological environments, influencing mating, reproduction, metabolism, and immunity of hosts2. These symbionts participate in many life activities of their host insects, for example, helping insects in resisting the invasion of pathogenic microorganisms and parasites, evading predators, developing resistance to insecticides, and synthesizing essential nutrients required by insects4. Furthermore, the composition and metabolic activities of symbionts are influenced by the selection and regulation of the host insects. Symbionts are considered as a unique “multifunctional organ” of insects and an indispensable component of insects. The study of insect symbionts has significant implications in biocontrol, interruption of vector-borne diseases, and prevention and control of insect pests5.

The phenomenon of genome reduction has been observed consistently across various obligate symbionts in insects, particularly within the suborder Sternorrhyncha. This symbiont is strictly dependent on the host, transmitted exclusively maternally, and has co-diversified with hosts for a significant duration, undergoing early genome loss yet often maintaining stability over time6,7. Examples include Carsonella ruddii in psyllids (158–166 kb)8, Portiera aleyrodidarum in whiteflies (281–358 kb)9, Tremblaya princeps (139–171 kb) and Moranella endobia (538 kb) in mealybugs10,11,12, as well as Buchnera aphidicola in aphids (with genome sizes ranging from 419 to 656 kb)13. The aphid-Buchnera system is a classic model to investigate insect-endosymbiont interaction. Aphids, feeding on phloem sap as their dietary source, face nutritional limitations due to the rich in simple sugars but unbalanced mixture of amino acids in sap14,15. Therefore, nearly all aphids rely on their specialized intracellular symbiotic bacteria, Buchnera aphidicola (Gammaproteobacteria), to provide them with essential amino acids, vitamins, and other important nutrients16. Buchnera is exclusively found within specialized bacteriocytes, which are symmetrical arrangements in the abdominal hemocoel of aphids17,18. Buchnera has been identified only in Aphididae species, and cannot survive independently outside the host aphids19. Therefore, expanding the dataset of Buchnera genomes is crucial for advancing our understanding of the evolutionary dynamics within these endosymbionts. The additional genomes will enable comprehensive analyses of genetic variation, adaptation, and co-evolutionary patterns, shedding light on fundamental aspects of endosymbionts and host-endosymbionts interactions.

The characteristics of clonality and maternal vertical transmission contribute to a faster fixation rate of slightly deleterious mutations in Buchnera than free-living relatives, which is reflected in the accelerated evolution of protein coding genes20, gene inactivation and loss21,22, leading to a significant reduction in genome size23,24. While the Buchnera have undergone genome reduction, most of them still retain the necessary genes related to the biosynthesis of essential amino acids required by aphids22,25,26. The genome size of Buchnera in the Lachninae is only from 422 kb to 458 kb, while in the Macrosiphini of Aphidinae ranges from 614 kb to 671 kb, reflecting the adaptive evolution between aphids and Buchnera at the genomic level13. Chong et al.27 analyzed 39 Buchnera genomes from 6 aphid subfamilies and found that the most recent common ancestor of Buchnera had at least 616 protein coding genes, which then experienced non-random gene loss in different lineages27.

With the continuous advancement of high-throughput sequencing technologies, the number of obligate symbiont genomes has increased. Currently, 77 Buchnera strains from 61 aphid species belonging to 10 subfamilies have been deposited in the GenBank database. However, the majority of these data are concentrated in Aphidinae, with 49 Buchnera strains from 33 species, accounting for approximately 63.6%. And the sizes of the published Buchnera genomes are mainly around 400 kb or 600 kb, which do not well represent the process of genome reduction. To thoroughly explore the diversity of Buchnera genomes across different aphid species, we sequenced 29 new genomes representing 19 aphid species of 11 subfamilies. This addition includes Buchnera genomes from four additional aphid subfamilies (Greenideinae, Mindarinae, Neophyllaphidinae and Taiwanaphidinae). To delve deeper into the reduction patterns of Buchnera genomes, a more comprehensive and reliable dataset is needed. Therefore, based on a quality control process, we constructed a robust dataset comprising 90 Buchnera genomes by combining the 29 newly sequenced genomes from 11 subfamilies and 61 selected genomes from the GenBank. The dataset we utilized demonstrates a more even distribution of Buchnera genome sizes, encompassing sizes such as 400 kb, 500 kb, and 600 kb. This diversity illustrates the ongoing process of genome reduction in Buchnera. In contrast, previous datasets predominantly featured Buchnera genome sizes concentrated around 400 kb and 600 kb. Additionally, our dataset includes 47 aphid species from a wider range of families, with primary representation from Aphidinae (600 kb) and Lachninae (400 kb). The extensive coverage of genome data can be effectively employed to validate the genome evolution of Buchnera within a phylogenetic framework, and contribute to understanding of the microevolutionary processes that shape genome reduction in insect obligate endosymbionts.

Methods

Sampling and DNA sequencing

Twenty-nine aphid samples from 11 subfamilies (Aphidinae, Calaphidinae, Chaitophorinae, Drepanosiphinae, Eriosomatinae, Greenideinae, Hormaphidinae, Lachninae, Mindarinae, Neophyllaphidinae, Phyllaphidinae, Taiwanaphidinae and Thelaxinae) were collected from May 2015 to June 2019 in various regions of China, including Fujian, Yunnan, Guangdong, Beijing, Zhejiang, Jiangxi, and Guangxi. All collected aphid specimens were identified to species level mainly based on morphological characters by experienced aphid taxonomists. Some samples from the same aphids originated from different geographical locations, spanning a considerable geographic range, evenly from south to north (Table S1). Each sample comprised of multiple individuals collected from the same aphid colony on a single host plant. Detailed sampling information is listed in the Supplemental Table S1. The specimens were kept in 95% ethanol and store at −20 °C after collection. Due to the different body size of aphid species, five to twenty apterous adult females of each sample were used for DNA extraction. After washing three times in ultrapure water, the genomic DNA extraction from the aforementioned pooled samples of each aphid species were performed using the DNeasy Blood and Tissue kit (QIAGEN), following the manufacturer’s manual.

All DNA samples were sent to Biomarker Technologies Co., Ltd. (China, Beijing) for metagenomic next-generation sequencing. The metagenomic library construction process involved fragmentation and purification of genomic DNA, followed by end repair and A-tailing. Subsequently, adapters were ligated to the fragments, and the resulting products were purified. PCR amplification was performed, and the resulting products were purified again. Finally, the library quality was assessed prior to downstream analysis. Following library construction, the library underwent Illumina (Illumina Corp., San Diego CA, USA) 2 × 150 paired-end sequencing using Illumina NovaSeq 6000 platform (Illumina Corp., San Diego CA, USA) with the NovaSeq 6000 S4 Reagent Kit (Illumina Corp., San Diego CA, USA). To ensure the quality of bioinformatics, raw reads were filtered to obtain clean reads. Trimmomatic28 software (parameters: LEADING: 3, TRAILING: 3, SLIDINGWINDOW: 50:20, MINLEN: 100) was used to filter raw tags and obtain high-quality sequencing data (clean tags). Finally, a minimum of 10 GB of sequence data (clean reads) was generated for each sample.

Genome assembly and annotation

Clean reads were utilized to assemble the genome of Buchnera. The assembly process began with the use of MEGAHIT v1.1.129, which generated numerous long contigs in a final assembly fasta document. Subsequently, 16S rRNA sequences of Buchnera from the same species of 29 aphids available on NCBI were downloaded. These 16S rRNA sequences served as query sequences to extract high-coverage Buchnera chromosome genomes from the final assembly sequences using BLASTN v2.5.030. The extracted sequences were then de novo assembled using Geneious v7.1.931. In cases where the genome was not circular, the extracted sequence served as a seed sequence for subsequent assembly in NOVOPlasty v3.532 (parameters: Insert size = 250, Read length = 150, Genome range = 40000–70000, K-mer = 33, 30, 27). Most of the Buchnera genomes were circularized through this process. For cases where circularization was not achieved, sequences from NOVOPlasty and MEGAHIT were combined and assembled together in Geneious. Finally, all Buchnera genomes were successfully circularized. The circular draft genome sequences were corrected by Pilon33 for bases correcting and mis-assemble fixing, based on the paired end data.

We additionally assembled the mitochondrial genomes (mitogenomes) of host aphids using the aforementioned methods. For this purpose, we utilized the cox1 sequences and complete mitogenome sequences of closely related species as seed sequences.The mitogenomes of Kurisakia onigurumii Mt23 & Mt24, Neophyllaphis podocarpi Mt25, and Neophyllaphis varicolor Mt26 were not circular and were represented by a long contig, respectively. However, we successfully assembled the mitogenomes of the remaining 25 host aphids into circular genomes.

Another 61 Buchnera genomes from the Genbank were selected and incorporated with our newly sequenced genomes into a comprehensive genome dataset of 90 Buchnera genomes. Among the 77 genomes accessible on NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9, 2022), we specifically chose 61 Buchnera genomes originating from diverse aphid hosts. In particular, only one Buchnera strain per aphid species was retained, typically favoring strains that had undergone detailed genome analysis and possessed the relative large genomes. Additionally, Buchnera genome sequences with high levels of degenerate bases (more than 5%) were not considered, despite species of this kind having only one Buchnera genome. For example, Hormaphis cornu isolate DLS_fromHcor80 (accession number: CP051840.1) has a total genome length of 643,231 bp, but it contains 60,999 degenerate bases “N”, accounting for 9.48% of the sequence. Such instances could potentially mislead future analyses related to genome structure and features. The names of all Buchnera strains are presented by the species name of their corresponding host aphids (Table 1). All complete circular genome were annotated through both the RAST Server v2.034 and Prokka v1.13.335. All genes that differed in length or location were manually curated. The gene names were standardized based on the Prokka annotation.

Table 1 Sources and genome characteristics (base composition, gene size, and gene number) of 90 Buchnera strains used in dataset construction.

Genome characteristics

The PhyloSuite v1.2.136 was used to extract the genomic feature information based on the well-annotated detailed files in GenBank format. Linear analysis was conducted to examine the relationship between genome size and GC content of the 90 Buchnera genomes. The genome size of Buchnera is from 395,344 bp (Buchnera from Greenidea ficicola 11) to 671,355 bp (Buchnera from Sitobion miscanthi), the average size is 531,263 bp (Table 1). The genome sizes of Buchnera across the Aphididae are larger than 600 kb, which representing the largest genome size observed among all Buchnera strains (Table 1). In contrast, the Buchnera strains from Hormaphidinae aphids display the smallest genome size (except Buchnera from Nipponaphis monzeni with 587,781 bp), only about 410 kb. Extreme genome reduction has been also confirmed in our newly sequenced Buchnera genomes, that of Hormaphidinae, Greenideinae, Chaitophorinae, Calaphidinae and Lachninae. And the Buchnera from Greenidea ficicola 11 (Greenideinae), which encodes only 363 proteins, is the smallest Buchnera genome by far (395,344 bp). The length and the number of different type genes in different Buchnera strains have been shown in Table 1. We can find that the number and length of tRNA, rRNA, tmRNA are relative stable in different Buchnera strains, but the genome length, CDS (coding sequences) length and count, and noncoding length were significantly different. The average percentage of noncoding regions of all Buchnera strains account for 12.5% of the total genome size. The GC content of all Buchnera is very low, from 18.3% (Buchnera from Periphyllus koelreuteriae 21, with genome size 452,078 bp) to 27% (Buchnera from Hyadaphis tataricae, with genome size 633,867 bp), and the average is 23.3% (Table 1 and Fig. 1). The number and size of different types of genes in the Buchnera genomes from all aphid subfamilies were extracted. Additionally, the size of non-coding regions was also extracted. The reduced genomes tend to have a smaller number of genes (Fig. 2). Although some Buchnera strains still have relatively large genome sizes (e.g., Buchnera from Nipponaphis monzeni and Anoecia oenotherae), their noncoding regions are larger (reaching 24.7% of the genome size) than others.

Fig. 1
figure 1

Relationship of genome size (kb) to GC content (%) in Buchnera genomes. Buchnera from Hyadaphis tataricae strain has the highest GC content (27%), Buchnera from Periphyllus koelreuteriae 21 has the lowest GC content (18.3%).

Fig. 2
figure 2

The length (indicated by asterisks) and number (indicated by circles) of various gene types in different Buchnera strains, as well as the length of the genome and non-coding regions. Different groups are represented by different colored lines. Various background colors represent Buchnera from different subfamilies of aphids, labeled with text in corresponding colors. The order of Buchnera strains from left to right corresponds to the order on the phylogenetic tree. CDS: coding sequence, mRNA: messenger RNA; tmRNA: transfer mRNA.

The COG categories classification of all Buchnera

The protein coding genes presenting in all Buchnera strains were classified into different functional categories based on previous study25. The two datasets of clusters of orthologous genes (COGs) of Buchnera aphidicola and Escherichia coli K-12 substr. MG1655 (https://www.ncbi.nlm.nih.gov/research/cog/) were downloaded for the COGs annotation of all Buchnera strains. All the genes were classified into different COG categories based on the two datasets. Functional categories of all Buchnera genomes was presented in Supplemental Table S2. All protein coding genes was conducted to identify orthologous gene clusters across all Buchnera strains. The clustering of 26 functional categories showed the distinct patterns of orthologous gene reduction during the evolution of Buchnera (Fig. 3 and Supplemental Fig. S1). But there was no gene grouped into the follow functional categories: B (Chromatin Structure and Dynamics), W (Extracellular Structures), Y (Nuclear Structure), and Z (Cytoskeleton). Nearly all COG categories have an ongoing reduction of orthologous gene clusters from Buchnera strains with big genome size to small genome size.

Fig. 3
figure 3

The gene clustering patterns of all Buchnera strains within different COG functional categories. Different colors represent different Buchnera strains within each category, with different subfamilies labeled in bold black text.

Data Records

The newly assembled genomes of 29 Buchnera strains (with strain names ending in numbers) are available through figshare37 as well as the National Center for Biotechnology Information (NCBI)38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66. The 29 mitogenomes of their host aphids, along with the comprehensive dataset based on 90 Buchnera genomes, can also be accessed through figshare37.

Technical Validation

Sequencing data quality control

In sequenced raw reads, low-quality sequences are present. To ensure the quality of bioinformatics analyses, the raw reads undergo filtering to obtain clean reads for subsequent bioinformatics analysis. Initially, Trimmomatic software is employed to filter raw tags and obtain high-quality sequencing data (clean tags). The clean reads data is obtained after the quality control of sequenced data.

Validation of genome assembly

We utilized NOVOPlasty v3.5 and MEGAHIT v1.1.1 for sequence assembly. The sequences from different software were aligned with other complete sequences of Buchnera. If the similarity is lower than 50%, these sequences will be excluded in the future genome assembly. The analysis suggested completeness ranging from 92.96% to 99.99%, with an average of 98.36%. The relatively lower completeness of some smaller genomes is due to the genome reduction in Buchnera, a widely recognized phenomenon.

Dataset quality control

The genome available in GenBank may not be complete or contain too many degenerate bases. Therefore, data cleaning is essential for construction of the comprehensive dataset and subsequent analyses. Initially, we conducted genome integrity assessments on all genomes downloaded from the GenBank database by CheckM2 v1.0.167. The genome with low completeness were removed, such as the Buchnera genome (CP033006) from Hormaphis cornu (completeness: 64.04%, coding density: 50%, genome size: 643,250 bp). The base composition was analyzed using BioEdit v7.0.5.368. Genomes with degenerate base content exceeding 1% of the total genome length were excluded. This step was crucial, as it may influence the subsequent analyses about genome length and the prediction of functional genes.