Introduction

Metabolism is defined as the whole set of chemical reactions that take place in organisms1. In particular, metabolic pathways represent chemical transformations where a substrate becomes a product, typically with the aid of other molecules such as cofactors2,3. Recent advances in sequencing technologies have significantly increased the coverage of metabolic pathways in dozens of organisms. Much effort has been done to integrate these metabolic pathway into genome-scale metabolic models (GEMMs)4, which aim to accurately define the stoichiometry of all reactions in a particular organism, their associated genes, enzymes or transporters, their compartment localisation and other relevant biological information. GEMMs allow us to analyse the metabolic capabilities of both unicellular and multicellular systems with computational tools developed in the field of constraint-based modelling5. Despite these advances, the understanding of cellular metabolism in many organisms is still incomplete, with significant gaps and metabolites that have no links to any reaction in available metabolic models6,7. For example, even in the well-annotated KEGG database8, there remains 10,000 metabolites that are still not linked to a known biochemical reaction6.

Different computational tools have been developed to fill in metabolic gaps by means of improving the functional annotation of enzymes. The number of annotated enzymes across databases is much lower than that of metabolic reactions, which could be pointing towards enzymes carrying out more than one biochemical reaction. In KEGG, while we have 11,822 reactions, we can only find 8012 enzymes (https://www.kegg.jp/kegg/docs/statistics.html). The same pattern is found in BRENDA9, where the numbers are 21,665 and 8332, respectively. Similarly, in AGORA10, a repository of metabolic networks that include 818 organisms from the human gut microbiota, we found a total count of 1438.1 reactions on average per organism. In the case of enzymes, considering the genomic annotation from GenBank11 and Ensembl12, we obtained a total of 900.7 enzymes on average per organism in AGORA. Considering this evidence, annotating enzyme promiscuity seems a promising strategy to improve metabolic networks.

Promiscuous activity of enzymes lies on their capacity to bind to non-canonical substrates and catalyse novel reactions13,14. Annotating such capabilities provides the possibility to understand underground metabolism, which is not represented in current databases15,16. The growing relevance on enzyme promiscuity in different fields has led to the development of a number of algorithms and computational methods6,17,18,19. In brief, these algorithms rely on a set of generic enzymatic reaction rules, which define the chemical transformations that occur to a substrate, describing its reactive site and atomic rearrangement as a result of the reaction6,20. These reactions rules describe an abstraction of known reactions and permit a certain degree of flexibility of the substrates involved21, potentially leading to new reactions and products.

Several methods use manually curated reaction rules22,23,24,25. Although these rules integrate the best knowledge about enzymes, they are limited to a reduced number of reactions21. For this reason, the development of computational tools able to automatically extract reaction rules from known transformations has received much attention. Considerable progress has been made in recent years19,21,26,27. RetroRules26 provides thousands of rules that are extracted from public databases and constitutes the core of different algorithms for predicting novel metabolic pathways, such as RetroPath RL20, the latest version of a series of works developed by the same authors20,26,28,29.

In a previous effort to elucidate the metabolism of phenolic compounds in the human gut microbiota, we applied RetroPath RL to predict phenolic degradation pathways in the human gut microbiota. As more than 2/3 of phenolic compounds in the Phenol Explorer database30 are not included in universal metabolic databases, it was necessary to employ computational approaches to uncover microbial phenol metabolism. Despite identifying degradation pathways for 80 compounds in the Phenol Explorer database that were not present in previous gut microbiota reconstructions, RetroPath RL could not find candidate pathways for 180 out of 372 of phenolic compounds in the Phenol Explorer database. Continuing this early work, we focus here on PROXIMAL27, an enzyme promiscuity algorithm that follows a different strategy to build reaction rules and, thus, could potentially complement the results obtained with RetroPath RL.

The PROXIMAL algorithm has been successfully applied to create extended metabolic models in different organisms and to annotate cellular products7,31. PROXIMAL makes use of the KEGG database to predict possible transformations8. In particular, it is based on RPAIRS32, a database available in KEGG that provides the necessary alignment between paired substrates and products to define the modified sub-structures. The main limitation of this algorithm is that it cannot be used for transformations not included in KEGG. Moreover, RPAIRS was discontinued in 2016 (https://www.genome.jp/kegg/kegg1a.html), which hampers the application of PROXIMAL to more recent updated versions of KEGG, that is in continuous development. Overall, these limitations restrict the application of PROXIMAL to our problem of phenolic compound degradation in the human gut microbiota, since we rely on AGREDA33, a metabolic reconstruction that contains relevant reactions not included in KEGG.

To address these issues, we present here a completely new pipeline, called PROXIMAL2, which overcomes the dependency on KEGG data. Moreover, PROXIMAL2 extends the previous methodology for the automatic reaction rule generation, which was unable to correctly model complex reactions involved in the phenolic compound metabolism. In particular, PROXIMAL2 introduces two relevant improvements with respect to the former version: i) correct treatment of multi-step reactions and ii) tracking of electric charges in the transformations. We show that PROXIMAL2 substantially extends the chemical space of PROXIMAL, and it correctly generates a higher number of reaction rules in KEGG. We also present the application of PROXIMAL2 to predict degradation pathways of phenolic compounds in the human gut microbiota and compare the results with RetroPath RL.

Results

PROXIMAL2 is a rule-based method for the prediction of metabolic products. As in the previous version of the algorithm, PROXIMAL27, PROXIMAL2 defines the set of reaction rules as look-up tables derived from substrate-product pairs. These look-up tables comprise 2 parts: ‘key’ and ‘value’. In brief, key tables specify the modified substrate structure, particularly defining the reaction centre, atom where the chemical transformation occurs; and value tables describe the modifications resulting in the product. Once these look-up tables are defined for every substrate-product pair, PROXIMAL searches for subgraphs in the query compound matching with key tables and applies the associated transformation defined in the value tables in order to generate putative products.

As detailed in Methods section, for the automatic generation of reaction rules, PROXIMAL2 implements an open chemoinformatic strategy, mainly based on the Python package RDKit34, which overcomes the dependency of PROXIMAL on the discontinued database RPAIRS and reactions in KEGG. In particular, for the identification of reaction centres, a crucial step in the definition of look-up tables, PROXIMAL2 conducts the necessary atomic alignment of substrates and products via the Maximum Common Substructure algorithm available in RDKit, which additionally allows us to deal with compounds involving rings and charged atoms, features of particular interest in phenolic compound metabolism. Moreover, PROXIMAL2 extends the scope of PROXIMAL, which is limited to reactions with a single reaction centre, making possible to deal with complex transformations involving multiple reaction centres that are typically found in phenolic compound metabolism. Full details of the PROXIMAL2 pipeline can be found in Methods section.

To illustrate the improvement of PROXIMAL2 over PROXIMAL in the automatic generation of reaction rules, we present below a side-by-side comparison between both approaches. Then, we apply PROXIMAL2 to predict degradation pathways of phenolic compounds in the human gut microbiota and compare the results obtained with a different enzyme promiscuity approach, RetroPath RL.

Comparison between PROXIMAL and PROXIMAL2

We first evaluated the accuracy of PROXIMAL2 in recovering annotated products from substrates in KEGG reactions, in comparison with the previous PROXIMAL algorithm (https://hassounlab.cs.tufts.edu/proximal/). Therefore, we applied PROXIMAL2 to the same set of KEGG reactions used by PROXIMAL. This subset of reactions involves 8819 reactions and 4983 associated metabolites (Supplementary Tables 1 and 2). In brief, the validation strategy followed here consists of generating the key and value tables for every substrate (S)-product (P) pair, following the methodology presented in the Methods section, and then assess if the product P is obtained when the same substrate S is applied as a query compound to their associated key and value tables (Supplementary Fig. 1). Rules that satisfy this requirement correctly captures the underlying chemistry of their associated reactions. We compared the accuracy of PROXIMAL2 with PROXIMAL under 3 different scenarios (see Fig. 1a).

Fig. 1: Comparison between PROXIMAL and PROXIMAL2.
figure 1

a Percentage accuracy in recovering annotated products in KEGG for PROXIMAL and PROXIMAL2 in three different scenarios: Case1, Case2 and Case3. In Case1 and Case2, we consider reactions with a single reaction centre. We do not consider atom charge information in Case1, but we do in Case2. In Case3, we consider all the reactions in the KEGG version used in PROXIMAL. The number of reactions used in each of the cases is shown in parenthesis. b Heatmap representing the chemical similarity of the predicted and annotated products in KEGG in Case 1 and 2 where at least one of the two algorithms fail to predict the annotated product. c Barplot representing the reaction coverage with look-up tables and those correctly predicting the annotated product in KEGG reactions. The y- axis shows the coverage in percentage. The total number of reactions are indicated over the bars.

First, we analysed the performance of PROXIMAL2 with the same limitations as in PROXIMAL. In particular, we only considered those reactions with a single reaction centre, reducing the study to 5991 reactions, and neglected the atom charge information from the predictions of PROXIMAL2. Under this scenario (Case 1 in Fig. 1a), PROXIMAL2 was able to generate the proper product for 5574 out of 5991 reactions (accuracy: 93%), while the predictions were correct for 5290 reactions in PROXIMAL (accuracy: 88.3%). This result illustrates that our chemoinformatic strategy in PROXIMAL2 (Steps 1–3), which overcomes the dependence on RPAIRS, produces more accurate results than PROXIMAL. In order to evaluate the effect of considering atom charge in PROXIMAL2, we updated the previous comparison and included this feature in our analysis. We now reached the proper product for 5814 reactions, obtaining an accuracy of 97% (Case 2 in Fig. 1a). This shows that the effect of atom charge in PROXIMAL2 further increase the accuracy of PROXIMAL2. Moreover, these results imply a relevant reduction of false positives in PROXIMAL2 with respect to PROXIMAL, since the number of incorrect predictions of annotated products is reduced by 8.7%.

To visualise the differences between PROXIMAL and PROXIMAL2 in Case 2, we extracted the molecules for which at least one of the two algorithms were not able to produce the correct product and computed the chemical similarity values between the predicted and annotated product (Fig. 1b). It can be observed that PROXIMAL2 is substantially more accurate than PROXIMAL, finding the annotated product (chemical similarity = 1) in many more cases. Note here that in both cases, Case 1 and Case 2, the improvement of PROXIMAL2 is highly significant (two proportions z-test p-value ≤ 2e−16).

In the last scenario, we included multi-centre reactions in our analysis and considered the whole set of 8119 reactions. PROXIMAL could not improve the accuracy, since it is not able to model multi-centre reactions, obtaining an accuracy of 65.2%. PROXIMAL2, instead, generated the correct product for 7744 reactions (accuracy: 95.4%), which illustrates the clear advance brought by PROXIMAL2 (Case 3 in Fig. 1a). Note here that we could only generate look-up tables for 7941 out of 8119 reactions with PROXIMAL2 (Fig. 1c) and, thus, the accuracy is even higher for these subset of reactions (97.5%). In both cases, the improvement of PROXIMAL2 is statistically significant (two proportions z-test p-value ≤ 2e−16). PROXIMAL2 was not able to generate look-up tables for the remainder 178 reactions mainly due to the incapacity to deal with stereochemical information and the restrictions of atom charge imposed in the definition of MCS (see Methods section).

In the construction of the Maximum Common Substructure (MCS) in PROXIMAL2 we fixed 2 optional parameters related atoms belonging to ring and atom charges (see Step 2 in Methods section). We carried out a sensitivity analysis to evaluate the effect of these heuristic choices in the outcome of PROXIMAL2. In particular, we considered the 4 possible cases, namely when both parameters are fixed, the two cases where only one of the parameters is fixed and none of the parameters are fixed. Supplementary Fig. 2 shows that best performance is obtained in the case that both parameters are considered, which justifies the use of these parameters in PROXIMAL2.

Finally, we assessed the robustness of the metabolic space generated in PROXIMAL and PROXIMAL2 by means of a Leave-One-Out strategy. In particular, for each annotated substrate (S)-product (P) pair, we evaluated whether P is recovered when S is applied as a query compound to any reaction rule except from the one they generate. PROXIMAL2 was able to recover the annotated products in 36% of the whose set of reactions, whereas PROXIMAL only 11%, which again emphasizes the improvement of PROXIMAL2 over PROXIMAL. This relevant difference is observed due to the greater capacity of PROXIMAL2 over PROXIMAL to generate correct reaction rules, as observed in Fig. 1c, which benefits its ability to recover leave out cases.

Application of PROXIMAL2 to predict the degradation of phenolic compounds in the human gut microbiota

Phenolic compounds are potent antioxidants that are derived from foods of plant origin and are mainly metabolised by the human gut microbiota. Despite increasing interest in the health and nutrition literature on phenolic compounds, their metabolism remains largely unknown. Universal metabolic databases, such as KEGG8 or the Model SEED database35, store reactions from species not present in the gut microbiota, and pathway extraction is not direct. For this reason, automatic reconstruction pipelines, such as AGORA10 or CarveMe36, include a limited number of phenolic compounds in curated genome-scale models of the human gut microbiota. Previously, we addressed this problem using a combination of computational methods, manual annotation and expert knowledge, leading to an improved version of AGORA, called AGREDA33. Moreover, we extended AGREDA using RetroPath RL, a popular enzyme promiscuity algorithm37. However, we could not find candidate pathways for 180 out of 372 of phenolic compounds in the Phenol Explorer database (Rothwell et al.30) and, thus, complementary approaches are required. In this sub-section, we continue our previous works by applying PROXIMAL2 to predict degradation pathways of phenolic compounds in the human gut microbiota and comparing the results obtained with RetroPath RL.

RetroPath RL is a rule-based method that makes use of the RetroRules database26 to investigate enzyme promiscuity. RetroRules defines the reaction centre based on an atom-atom mapping between substrates and product atoms and compute reaction rules using the reaction SMARTS (SMILES Arbitrary Target Specification) formalism. The level of specificity of reaction rules in RetruRules can be tuned according to the diameter parameter, i.e. the size of a hypothetic sphere around the reaction centre. Moreover, RetroPath RL allows the user to fix the threshold of chemical similarity between the query compound and the substrate of the rule. To apply RetroPath RL in the most similar conditions to PROXIMAL2, we fixed the diameter at 4 and the chemical similarity threshold at 0.6. Moreover, we defined the same metabolic space for both approaches, namely the one introduced in our previous work37, which involves 5087 reactions taken from AGREDA and the Model Seed Database (Supplementary Table 3).

A high proportion of reactions considered in the metabolic space of AGREDA were present in the RetroRules database; however, we also included some manually curated reactions important for the metabolism of phenolic compounds33,37. PROXIMAL2 only requires the biochemical reaction equation to generate the look-up tables (reaction rules). For the reactions present in RetroRules, we extracted the equations from MetaNetX database, whereas the equations for the manually curated reactions were obtained from AGREDA. Following the complete pipeline of PROXIMAL2, we could generate look-up tables (reaction rules) for 4860 reactions (Fig. 2a). Reaction rules for RetroPath RL, in contrast, were directly obtained from the RetroRules database at diameter 4. For manually curated reactions, we had to create them one by one using the RetroRules webpage (https://retrorules.org/diy). As a result, we generated reaction rules for 5064 reactions with RetroPath RL (Fig. 2a). Overall, PROXIMAL2 shows a higher level of automatisation to generate reaction rules.

Fig. 2: Comparison between PROXIMAL2 and RetroPath RL.
figure 2

a Representation of reaction coverage regarding the ability to generate rules by the algorithms and the coverage of phenolic compounds to produce promiscuous products. The number to the right of the bars represents the absolute number of reactions and compounds; b Venn diagram of the phenolic compounds that can be potentially degraded by PROXIMAL2 and RetroPath RL; c Representation of the different sub-classes of the 53 phenolic compounds captured by PROXIMAL2. The number of compounds belonging to the sub-class is expressed in the legend, e.g. ‘Lignans, 16’; d Frequency of sub-groups of foods associated with the 53 phenolic compounds exclusively predicted by PROXIMAL2 in comparison with37; e Taxonomic classes involved in the predicted reactions for lignans and isoflavonoids. The number in the legend represents the contribution of each class to the predicted reactions; f 3 reactions predicted by PROXIMAL2 in the subset of 35 phenolic compounds for which RetroPath RL did not reach a solution.

We compared the reactions rules obtained with PROXIMAL2 and RetroPath RL, finding that 4837 were present in both cases, whereas 23 and 227 were unique to PROXIMAL2 and RetroPath RL, respectively (Fig. 2a). The main differences between both approaches in the generation of rules are due to several reasons: (i) different treatment of cofactors (29.5%); (ii) ability of RetroPath RL to deal with stereochemistry (25.1%); (iii) reactions no longer present in the MetaNetX database, i.e. impossibility to download the reaction equation from MetaNetX and, consequentially, they were not considered with PROXIMAL2 (16.7%); iv) differences in the definition of MCS and reaction centres, e.g. the mandatory condition in the extraction of MCS in PROXIMAL2 (but not in RetroPath RL) that a match between atoms included in rings can happen only if the atoms are part of rings in both substrate and product (15%). As a result, RetroPath RL appears to cover a slightly wider area of the chemical space than PROXIMAL2.

Then, we applied the generated rules with both approaches to 372 phenolic compounds (Supplementary Table 4) obtained from Phenol-Explorer database30. Since RetroPath RL provides all reaction products that can be generated from a query compound and PROXIMAL2 generates one product at a time, we filtered out the results for which PROXIMAL2 was not able to predict the whole set of putative products. We obtained results for 354 out of 372 phenolic compounds using PROXIMAL2, whereas RetroPath RL generated products for 323 (Fig. 2a). Specifically, 319 of them were in common, while 35 were specific for PROXIMAL2 and 4 for RetroPath RL (Fig. 2b). The differences observed between PROXIMAL2 and RetroPath RL are mainly caused by two factors: i) a different filter of chemical similarity for the query compound, namely PROXIMAL2 uses the Dice coefficient and RetroPath RL the Tanimoto coefficient; ii) differences in the definition of MCS, noted above, which determines a different reaction centre and local neighbourhood.

These differences are even higher if we concentrate on the predicted reactions with these methods. In this case, we found an overlap between PROXIMAL2 and RetroPath RL of only 30%. For this comparison, we considered the same number of predicted products for each phenolic compound in both cases. In particular, we extracted the number of predicted reactions by RetroPath RL for each phenolic compound and fixed this same number of predicted reactions for PROXIMAL2, according to chemical similarity with the annotated products. We assumed that both approaches reached the same product from phenolic compounds when the chemical similarity (Dice coefficient) was equal to 1. This result, in our opinion, emphasizes the complementarity of these approaches.

In our previous work37, we applied RetroPath RL to the same 372 phenolic compounds and extracted putative products for 303 of them using the recommended diameters by the authors (more than 6). Here, PROXIMAL2 could generate putative products for 53 additional phenolic compounds (301 were in common between PROXIMAL2 and RetroPath RL under this scenario). These 53 phenolic compounds were connected to 430 metabolites involved in AGREDA. Interestingly, we found that Lignans and Isoflavonoids were highly represented in this set of metabolites, namely 16 and 9 metabolites, respectively (Fig. 2c). In addition, according to the food composition provided by Phenol-Explorer, the 53 phenolic compounds are part of 8 sub-groups of food, with Soy and soy products and Fruits – Berries the most annotated (Fig. 2d). Moreover, we analysed the taxonomies of predicted reactions for the degradation of this subset of 53 phenolic compounds. Figure 2e shows the contribution of different taxonomic classes to the reactions involved in the degradation of lignans and isoflavonoids (see Supplementary Table 5 for details). It can be observed that the most relevant classes are Bacilli, Clostridia, Bacteroidia, Gammaproteobacteria and Actinobacteria, in line with other phenolic compounds previously annotated in AGREDA. The same analysis at the species level can be found in Supplementary Table 6.

Overall, these results show that PROXIMAL2 complements our previous analysis with RetroPath RL and opens new research directions to understand the metabolism of phenolic compounds in the human gut microbiota. A complete biological validation of the reactions and metabolites predicted by PROXIMAL2 requires further analysis and data. However, for illustration of the relevance of the results obtained, we present below some specific cases for which PROXIMAL2 (and not RetroPath RL) leads to biologically meaningful hypotheses.

First, we found different glucuronide species where PROXIMAL2 predicts the loss of glucuronic acid. One example can be observed in Fig. 2f, where the degradation of phloretin_2’-O-glucuronide into phloretin and glucuronic acid is shown. This metabolite is present in apples and derived foods, such as apple juices38. The output product, phloretin, has been shown to exert anti-inflammatory activity in different diseases via the gut microbiota, e.g. ulcerative colitis39. The template glucuronidase enzyme, from which this reaction is predicted, can be found in 774 metabolic models in AGREDA, particularly among the following classes: Bacilli, Clostridia and Bacteroides.

Another relevant phenolic compounds for which PROXIMAL2 predicts a degradation reaction is 4-hydroxyhippuric acid, which is converted into glycine and 4-hydroxybenzoic acid (Fig. 2f). Our predicted reaction has been proposed elsewhere in the literature but in the opposite direction40. This could be a plausible hypothesis because 4-hydroxybenzoic acid is part of a wider set of foods and 4-hydroxyhippuric acid is more often found in urine samples, according to PhenolExplorer. Interestingly, in contrast with other phenolic compounds, 4-hydroxyhippuric acid seems to be pro-inflammatory by increasing the level of cytokine TNF-α41. The template enzyme from which this reaction is predicted takes part in 51 metabolic models in AGREDA, being Gammaproteobacteria, Alphaproteobacteria and Betaproteobacteria the main relevant classes.

Finally, PROXIMAL2 predicted the degradation of the lignan cyclolariciresinol, also called isolariciresinol. This phenolic compound is found in different fruits, including blackberry, blackcurrant or strawberries, and sesame seeds, according to PhenolExplorer; however, it has been abundantly identified in wines42 and soy-based supplements43, among others. Previous works in the literature have hypothesised the antioxidant activity of isolariciresinol44. This is consistent with findings that other lignan metabolites, such as secolariciresinol, have a protective role against a variety of diseases45. With respect to the metabolism of isolariciresinol, there is limited evidence in the literature. Here, we propose the transformation of isolariciresinol into scaphopetalone (Fig. 2f). This metabolite has been identified in different plants46 but not in human samples so far. However, the activity of scaphopetalone derivatives against Herpes simplex and Herpes zostera has been recently demonstrated47. The template enzyme from which this reaction is predicted takes part in all metabolic models in AGREDA.

Discussion

Cellular metabolism involves the totality of chemical transformations that can occur in organisms and, though our knowledge is continuously growing, many metabolic pathways are still incomplete. A relevant case in the field of nutrition is the metabolism of phenolic compounds in the human gut microbiota, which remains largely unknown. Phenolic compounds, which are mainly derived from foods of plant origin48,49, are converted into bioactive metabolites that appear to limit the risk of several major diseases, such as coronary heart disease50, cancer51 or diabetes52. This fact has stimulated research to complete the knowledge about the degradation pathways of these nutrients in the human gut microbiota.

Recently, several methodologies have been developed to fill in metabolic gaps. The computational analysis of the metabolic space through enzyme promiscuity has received much attention. Specifically, rule-based methods have grown in number and quality in the last years21. In a previous work37, we applied a well-known rule-based enzyme promiscuity algorithm, RetroPath RL, to predict the degradation pathways of 372 phenolic compounds from Phenol-Explorer. Here, we explore a different rule-based algorithm, PROXIMAL, to address the same question. Given the current limitations and KEGG dependencies of PROXIMAL, we developed PROXIMAL2, which can automatically generate rules for a wider spectrum of reactions and make more reliable and comprehensive predictions for the degradation of phenolic compounds in the human gut microbiota.

As detailed in Methods section, without KEGG dependencies, PROXIMAL2 automatically extracts reactions rules without relying on the KEGG database, and replicates the look-up tables defined in PROXIMAL for predicting novel reactions. Further, PROXIMAL2 includes new features that were not part of PROXIMAL and expands its scope of application. In particular, PROXIMAL2 is able to capture complex transformations that involves multi step reactions through the development of multi-centres look-up tables. In addition, a detection of possible atom charges was implemented in PROXIMAL2, which allows tracking the charges present in substrates and products and adding or removing charges depending on the transformation. These new features of PROXIMAL2 significantly improved the performance of PROXIMAL, as described in the Results section with the comparison with KEGG reactions. PROXIMAL2 increased the coverage of KEGG reactions that can be potentially used for predicting enzyme promiscuity, namely 1950 KEGG reactions can be considered with PROXIMAL2 but not with PROXIMAL. Second, PROXIMAL2 shows higher accuracy than PROXIMAL in recovering annotated products from the substrates in KEGG reactions in the different scenarios considered in Fig. 1a.

PROXIMAL2 was applied to predict novel degradation routes of phenolic compounds in the human gut microbiota. Note here that this study had not been feasible with PROXIMAL, given its KEGG dependences and the complexity of annotated reactions in AGREDA for phenolic compounds. We compared the results of PROXIMAL2 with RetroPath RL. With respect to rule generation, we found that PROXIMAL2 is more practical in the automatic extraction of reaction rules, as illustrated with the subset of manually curated reactions included in the metabolic space. On the other hand, RetroPath RL covers a wider range of chemical space, obtaining reaction rules for 227 reactions for which PROXIMAL2 could not obtain look-up tables. Although the difference is limited, we expect to address these limitations in future developments.

In addition, we found that PROXIMAL2 and RetroPath RL have a significant overlap in the subset of phenolic compounds for which they could find a putative product. Both algorithms obtained a promiscuous product in 319 out of 372 compounds (85.7%). However, we found PROXIMAL2 and RetroPath RL complementary; PROXIMAL2 predicted new output products for 35 phenolic compounds that were not captured by RetroPath RL, which represents an 9% increase in identification of phenolic compounds. We found that these differences are dependent on the choice of specific parameters, e.g. diameter in RetroPath RL or chemical similarity coefficient used in both approaches, which emphasizes the importance of investigating the optimal set of parameters for different algorithms. Importantly, our results open new research directions in the metabolism of phenolic compounds in the human gut microbiota. We expect to further analyse and experimentally validate the results obtained from this study.

Finally, beyond the problem of phenolic compound degradation in the human gut microbiota, PROXIMAL2 is a general-purpose algorithm and can be applied to predict novel associations of other molecules of interest to putative products. In fact, the previous version of PROXIMAL2, PROXIMAL, has been previously applied to a variety of biological questions, including the prediction of xenobiotic metabolism27 to create extended metabolic model of E. coli7, and to suggest biological molecular candidates when annotating metabolomics data31. As PROXIMAL2 has the same features than PROXIMAL, it can be complementarily used with other existing rule-based methods, such as RetroPath RL, to provide novel insights into metabolic gaps for different applications.

Methods

PROXIMAL27 is a rule-based method for the prediction of metabolic products. PROXIMAL defines look-up tables for a database of substrate-product pairs. These look-up tables link specific molecular substructures with their associated chemical transformations (Fig. 3a). In particular, ‘keys’ in the look-up tables specify the modified substrate structure, including: i) the reaction centre, atom where the chemical transformation occurs; ii) adjacent neighbours, atoms connected to the reaction centre at distance 1; iii) distant neighbours, atoms connected to the reaction centre at distance 2. In other words, the modified substrate structure induces a subgraph of neighbours within radius 2 starting from the reaction centre. In addition, ‘values’ in the look-up tables describe the modifications resulting in the product, including: i) reaction centre, ii) adjacent neighbours and iii) added/removed functional group, which defines the modified part of the substrate. Note here that the atoms in the look-up tables follow the nomenclature of ‘KEGG atom types’, which are defined according to their functional groups and microenvironment, e.g. C8x or C8y.

Fig. 3: Schematic representation of the PROXIMAL workflow.
figure 3

a Definition of an example look-up table in PROXIMAL. After the pair selection, the substrate and product are represented in KEGG atom type and then aligned to permit the definition of the reaction centre and its neighbours up to distance 2. The starting reaction is (R01296): 4-Hydroxybenzoate+O2+NADH+H(+) -> 3,4-Dihydroxybenzoate+H2O+NAD(+). b Application of the look-up table to an example selected query. The latter is represented in KEGG atom type and the key of the look-up table is searched within the molecule. Finally, the product is generated based on the look-up table value. The red, blue and green circles represent the reaction centre, adjacent and distant neighbour, respectively. The atoms O and N are marked in red and blue, respectively, as conventionally used in chemistry by the CPK colouring rules.

Once these look-up tables are defined for all substrate-product pairs involved in the database of reactions of interest, PROXIMAL lists the different subgraphs of neighbours within radius 2 for a given query compound, searches for the ones matching with the sub-structures stored in the key tables and applies their associated transformation defined in the value tables in order to generate putative products. Figure 3b illustrates that the subgraph of neighbours centred at atom number 8 in the query compound L-Tyrosine matches with the key table defined in Fig. 1a for the pair 4-Hydroxybenzoate and 3,4-Dihydroxybenzoate, leading to the compound 3,4-Dihydroxy-L-phenylalanine.

The identification of the reaction centre for each substrate-product pair is the critical step in the definition of look-up tables. This is done following different steps that require i) the alignment of the substrate and product atoms and ii) the classification of different atoms into different ‘KEGG atom types’ (Fig. 3a). Reaction centres are defined as any specific substrate atom that aligns with a product atom of different KEGG atom type. In Fig. 1a, the reaction centre corresponds with the atom number 8 of 4-Hydroxybenzoate and the atom number 8 of 3,4-Dihydroxybenzoate, whose KEGG atom types are C8x and C8y, respectively.

The look-up tables in PROXIMAL were built using the RPAIRS database, where substrate-product pairs are defined for each reaction, atomic species are classified into KEGG atom types, and substrate atoms are aligned to product atoms. Unfortunately, as noted above, RPAIRS database was discontinued in 2016, which restricts its application to a limited subset of reactions and, thus, metabolic space (i.e. set of potential biochemical transformations and molecules17). Therefore, we present here a completely new pipeline, called PROXIMAL2, summarised in Fig. 4a.

Fig. 4: PROXIMAL2 workflow and illustration of multi-centre reactions.
figure 4

a PROXIMAL2 pipeline summarising main changes with respect to the previous version. b Example of multi-centre reactions. The alignment provides two changing KEGG atom types (atom numbers 3 and 4), leading to two reaction centre that are labelled as R1 and R2. In the look-up table, the information about the reaction centres and respective neighbour is stored. The reaction is (R01091): L-Leucine -> (3R)-beta-Leucine.

Step 1: Definition of the database of reactions, metabolites and structural information

First, we created the metabolic space that is required to apply our enzyme promiscuity algorithm. Starting from an input database of reactions, we first removed the most common cofactors involved within the transformations and deleted the reactions if they only included cofactors. Then, the metabolites involved in the different transformations were extracted. We obtained SMILES and InChI for each metabolite from different public databases: PubChem53, KEGG8, HMDB54, MetaNetX55 and RetroRules26. Metabolites with no available structure were filtered out from the reactions; similarly, input reactions only involving metabolites without structure were deleted. As a result, we obtained a list of simplified reactions.

Step 2: Definition of substrate-product pairs

As noted above, PROXIMAL works with substrate-product pairs. However, the output (simplified) reactions from Step1 can include more than one substrate or product. To identify the best matching pairs for a given reaction, we calculated the chemical similarity between each substrate-product pair and paired them according to such value. The chemical similarity was determined using the RDKit package34 and the Morgan Fingerprint56. In reactions with a single substrate (or product), the latter is associated to each product (substrate) to form more than one pair. The output of this step is the list of substrate-product pairs for the generation of look-up tables. Note that with this chemoinformatic approach we recovered 97.5% of pairs annotated in RPAIRS.

Step 3: Alignment of substrate-product pairs

As noted above, the identification of the reaction centre and look-up tables for each substrate-product pair requires their atomic alignment. To that end, we extracted the maximum common substructure (MCS) between the substrate and product with the function findMCS available in the module rdFMCS in RDKit package. We fixed two optional parameters in this RDKit function in order to consider the specificity of the atoms belonging to rings and charged atoms. In particular, a match between atoms included in rings can happen only if the atoms are part of rings in both the substrate and product. Similarly, two atoms can match with each other only when both have the same charge (Supplementary Note 1, Supplementary Fig. 3).

Step 4: Definition of reaction centres and look-up tables

As noted above, reaction centres are defined as any specific substrate atom that aligns with a product atom of different KEGG atom type. Here, the classification of substrate and product atoms into different KEGG atom types were done with the package KCF-Convoy57,58. With this information and the atomic alignment (Step3), reaction centres were identified.

A relevant improvement included in PROXIMAL2 is the possibility to analyse reactions containing multiple reaction centres. In some cases, it may happen that a reaction represents a multi-step reaction59, where the intermediate steps are removed for any reason (e.g. inability to measure the intermediate), and they are unified in the transformation that connects the initial substrates directly to the final products. This concept translates into having more than one reaction centre when the changing bonds are analysed. For each reaction centre involved in these transformations, look-up tables were defined, as shown in Fig. 4b.

In addition, in some cases, we found a modification of the bonds between the atoms within the MCS, showing a new arrangement due to the transformation. Specifically, during the rearrangement of the structure, a bond can be introduced, deleted or simply changed (e.g. going from a double to single bond). These modifications were also extracted together and included in the look-up tables as multiple reaction centres (see an illustration in Supplementary Fig. 4). Finally, the information about the possible position of a charged atom is extracted in order to apply that charge when the transformation is applied.

Step 5: Search of matching keys within the query compound

Once the look-up tables representing the chemical transformations are generated, we define the subgraph of neighbours within radius 2 for each atom of the query compound and search for matching keys in the look-up tables. For multi-centre reactions we ensure that the query compound matches with all their associated keys. Note here that a pre-filter was implemented to any operator applied to the query compound. In particular, we discard matches where the chemical similarity between the query compound and the substrate of the matching entry was below 0.6.

Step 6: Product generation

Once an operator in the look-up table matches the query compound, we generated the promiscuous product. Therefore, considering the molecule of interest, the transformation is applied to each reaction centre, adding atoms and bonds coherently according to the original template reaction. When there is more than one reaction centre, it can happen that a substructure, which must be added to the query, is in common between several reaction centres. To avoid the addition of that substructure multiple times, we implemented a tracking system of the atoms and bonds already introduced, rejecting duplicate additions. Then, bonds are added, removed or changed coherently to the operator definition. Finally, information about the charges present in the query is kept and introduced in the generated product molecule. If a charged atom was removed due to the transformation, the charge will not be present in the final product. Similarly, when a functional group introduced in the structure contained a charged atom, that charge was introduced in the predicted molecule.

Once all the changes within the chemical transformation are applied, the generated products are saved as json file, where the information about the final product (Smiles ID and mol block text), the reaction template (ID, EC number and Formula), and the initial molecule (name, ID and structure) is stored. Although the generation of the mol block text was always possible because it was generated manually, the SMILES string was generated using the RDKit package.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.