MICheck (Microbial Genome checker), is a web software enables rapid verification of sets of annotated genes and frameshifts (syntaxic annotation), in complete bacterial genomes that are available in public databanks. The strength of the strategy rests on the use of several gene models we have characterized in a large set of bacterial genomes, and which allow to take into account compositional heterogeneities of the DNA sequences. Given a raw DNA sequence and its set of annotations (in GeneBank or EMBL format), MICheck first runs AMIGene and ProFED methods, either with suitable gene models or with a new gene model computed from the user's input annotations. The set of new predicted CDSs (AMIGene CDSs) are then compared to the user annotations (user CDSs) and unique CDSs from AMIGene are submitted to Blast comparisons against the SWALL databank. The web interface allows one to investigate the MICheck results using our graphical representation in which the genomic context of a unique CDS annotation, or a predicted frameshift, is drawn using information on the coding potential (curves) and annotation of the neighbouring genes. In the context of the numerous re-annotation projects of microbial genomes, this tool can be seen as a preliminary step before the functional re-annotation step, to quickly check for missing or wrongly annotated genes.
Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from annotated DNA sequences or from a large enough set of anonymous DNA sequences.
These two choices depend on the DNA sequence you are going to analyze :
1) if the corresponding organism is very closed to one of the listed species (i.e, strains), you can select the corresponding species in order to use the gene models we have built up, as explained below.
2) if your genome is really new, it is recommended to build an appropriate gene model that will better bit with the input DNA sequence.
"Build a Gene Model"
The minium length of the input sequence should be at least 10 kb. The prokov-learn method (Alain Viari, personnal communication) will run with the highest Markov order model possible (2, 3, or 4).
The minimum size of the Open Reading Frame (ORF) being extracted from the input sequence depends on its GC content. For our three reference genomes, the minimum ORF size were the following :
- B. subtilis (GC content of 43%) : 500 bp
- E. coli (GC content of 51%) : 700 bp
- M. tuberculosis (GC content of 66%) : 900 bp
We then first compute the GC% of the input sequence, and as a first approximation, we used a linear regression (obtained with the values of the three genome models) to determine the minimum size of the ORFs being extracted.
For a Markov model of a noncoding sequence, we used the zero order model with four probability parameters estimated by genome specific frequencies of mononucleotides.
=> the corresponding gene model is called pre-matrix and is used by AMIGene to predict CDSs.
"Use specific gene models for an organism"
The AGC (Atelier de Génomique Comparative) group systematically investigates codon usage differences in genomes, to identify major factors influencing synonymous codons frequencies within a set of CDSs. The following procedure is performed for complete genomes (public or non-public) we are interested in :
The multivariate statistical technique of Factorial Correspondence Analysis (FCA) is used to identify major trends acting on codon usage within the annotated or predicted CDSs.
The K-Means algorithm is also used on absolute and/or relative codon frequencies of the coding sequences, k being equal to 2,3 or 4 depending on the contribution of each major factor to codon usage bias.
The gene classes defined the training sets for protein-coding regions, the rest of the sequence being included into the noncoding training set. Corresponding gene models (2,3 or 4 in total) are generated by the prokov-learn program and subsequently used in the core of the AMIGene method.
A list of complete Bacterial genomes analyzed is reported on the current database status web page.
The AMIGene method is performed using as input the DNA sequence and the gene model(s) in order to predict putative CDSs. The parameters values are set to the optimized parameters obtained with GC-poor, GC-Medium and GC-Low genomes ( B.subtilis, E. coli or M. tuberculosis respectively).
For further details, see the AMIGene method here.
The GC content can also be automatically computed.
This section of the MICheck form allows the user to upload a file containing the sequence and the gene annotations (in EMBL or GENBANK format). The MICheck strategy can be performed either on a complete genome or a part of it (10 kb are required to build a new gene model ).
A label is required to name the predicted CDSs. If this label is for example PL (for Photorhabdus luminescens) the CDSs listed in the output page will be called PL0001, PL0002, .., etc.
When the process is complete, an URL is sent to the e-mail adress specified by the user to view the results. The URL link is available one week and allows the user to consult MICheck predictions through a graphical interface.
The MICheck results summary is divided into four tables.
The first one, shows a statistical overview of the execution.
The second table shows a report of the unique CDSs found by the AMIGene strategy.
In this context a CDS is described by the label, the start position, the end, the length in base pairs and in amino acids residues, the frame and the results of Blast searches in SWISSPROT.
In the third one is reported the summary of the unique annotated CDSs (labelled Unique User). The same results, as the table above, are described except for the Blast results.
In the last one, the user can find the profed results for the whole sequence.
The graphical results of the MICheck includes the list of annotated and predicted CDSs in text format, and a graph reporting the probability curves (both the position of the CDSs and the coding prediction curves in the 6 reading frames of the input sequence). The graph is fully dynamic and allows the user to navigate along the sequence ; the corresponding list of predicted CDSs is updated accordingly.
In addition, MICheck creates several output files that can be subsequently downloaded :
1) one file containing the list of unique predicted CDSs
2) one fasta file containing the nucleic sequences of the unique predicted CDSs
3) one fasta file containing the corresponding protein sequences.
4) one file containing the position of putative frameshift errors which have been found by the ProFED method described below.
ProFED : Procaryotic Frameshift Errors Detection
The ProFED method allows for putative frameshifts detection in DNA sequences. To be performed, it needs a set of annotated genes (original databank annotations and AMIGene CDS predictions), one or several gene models and a DNA sequence, actually the one that that is contained in the annotation file.
AlgorithmA sliding window of length W is used to compute the mean of the coding probability in the corresponding region (i.e., the highest probability obtained with one of the input gene models). If this value satisfies a preset threshold, Tpcod, several cases are analyzed :
A CDS has been annotated in this position of the DNA sequence (identical frame i). ProFED looks for an another frame j, on the same strand, containing an annotated CDS at the same position. If the two corresponding CDSs overlap on a length which is greater than a threshold, Tover, the detected putative frameshift corresponds to partially overlapping CDSs (status 'Overlapping'). If the smallest CDS is included in the longest one, an inclusion percentage is computed and compared to the threshold Tincl. This identifies frameshifts with the 'Compensated' status.
No CDS has been annotated in this position of the DNA sequence, in any of the six reading frames. ProFED computes the number of sliding windows with a satisfactory coding prediction curve with respect to the beginning of the next CDS, or with respect to the decrease of the coding prediction curve. If this number is greater than a threshold, Tlopc, the detected putative frameshift is labelled with the 'Onlypcod' status.
Default parametersThe default parameters used in the Profed method were first determined empirically, based on several runs performed on various bacterial genomes. These parameters are summarized int the table below:
Level of confidenceAdditionally, a level of confidence is given for each predicted frameshift. This level depends exclusively on the length of sequence involved in the frameshift. Again, several cases can be considered.
Case of an 'Overlapping' frameshift.