KEGG is an integrated database resource, broadly categorized into systems information, genomic information, and chemical information (http://www.genome.jp/kegg/). The tool implemented this interface allows one to get an overview of the list of KEGG metabolic pathways which contain the genes of a query genome (from our Prokaryotic Genome DataBase PkGDB) annotated as enzymatic functions. Remind that coloration of the KEGG map is only based on the EC number annotations. Because of the KEGG maps 'mosaicism' (generally large metabolic maps representing pathway variants in many genomes for the production or degradation of a compound), this tool has to be used in parallel with the results obtained with Pathway tools (MicroCyc functionality).
MicroCyc is a collection of microbial Pathway/Genome Databases (PGDBs) which are created in the context of the MicroScope projects. They are supported by the Pathway tools software developed by Peter Karp and his team at SRI international. These PGDBs were generated using the PathoLogic module which computes an initial set of pathways by comparing a genome annotations to the metabolic reference database MetaCyc.
For each studied genome, the annotation data is extracted from our Prokaryotic Genome DataBase (PkGDB) which benefit both the (re)annotation process performed in our group (AGC), the enzymatic function prediction computed with the PRIAM software, and the expert work for functional annotation made by a various community of biologists using the MaGe system. These automatically generated PGDBs (Tier3) are updated every day.
Starting with the set of KEGG or MicroCyc metabolic pathways predicted for each genome integrated in the PkGDB database, this method compares the metabolic content of the selected bacterial genomes. This comparison is based on the computation of a 'pathway completion' value i.e, the number of reactions for pathway x in a given organism/total number of reactions in the same pathway x defined in the MetaCyc or KEGG databases. To get an overview of the way compared genomes are clustered according to their metabolic capabilities, the MeV tool (http://www.tm4.org/mev/) has been integrated to this functionality: it takes the result comparisons as input and performs a hierarchical clustering using the pathway completion values.
This tool combines, for one query genome, two different neighbourhoods in order to give clues in terms of functional annotation for proteins of unknown function (hypothetical protein). It searches for the genomic regions containing genes involved in synteny groups with the compared bacterial genomes (from our Prokaryotic Genome DataBase PkGDB) AND also involved in metabolic pathways (either KEGG or Metacyc hierarchy).
Starting from a list of MicroCyc pathways predicted by the Pathway Tools software, users are able to curate the prediction for a given organism by assigning different status.
Distinct pathway status are:
- predicted: predicted by the BioCyc pathologic algorithm (default one)
- validated: curated as a functional pathway (all the reactions of the pathway are supposed to exist in the organism)
- variant needed: the predicted pathway is not completely correct for the organism (i.e. some reactions may not be present in the organism but no better pathway definition exists in MetaCyc). Thus, a new pathway variant definition is needed.
- unknown: not enough evidence to declare the pathway as functional (i.e. validated status)
- non functional: the pathway has been lost in the organism and is no more functional (i.e. due to gene loss or pseudogenisation events)
- deleted: curated as a false positive prediction
CanOE was designed with the specific objective of finding candidate genes for sequence-orphan enzymatic activities. Its results can be exploited in several ways:
- Genomic metabolons can serve as a visual summary representation of the metabolic contexts corresponding to groups of co-localised genes. This simplifies the reconstruction effort required to assess this context when annotating a gene of interest. As such, metabolons can be useful aids to annotation.
- Potential gene-reaction associations are automatically generated. They can be visualised and considered when attempting to annotate the concerned genes.
- The integration across multiple genomes allows CanOE to propose the transfer of functional annotations on the basis of sequence similarity backed up by metabolic context. The Inferred associations can also be visualised and considered when attempting to annotate the concerned genes.
- CanOE is able to propose candidate genes even for sequence-orphan reactions. Furthermore, gene families can receive family-based associations scores and ranks for these reactions. Candidate genes for orphan enzymatic activities can thus be ranked in each organism.