|
Gene Model
Computer methods of accurate gene finding in DNA sequences require models
of protein coding and non-coding regions derived either from annotated DNA
sequences or from a large enough set of anonymous DNA sequences.
These two choices depend on the
DNA sequence you are going to analyze :
1) if the corresponding organism is very closed to one
of the listed species (i.e, strains), you can select the corresponding species
in order to use the gene models we have built up, as explained below.
2) if your genome is really new, it is recommended to
build an appropriate gene model that will better bit with the input DNA sequence.
"Build a Gene Model"
The minium length of the input sequence should be at least 10 kb.
The prokov-learn method
(Alain Viari, personnal
communication) will run with the highest Markov order model possible (2, 3, or 4).

The minimum size of the Open
Reading Frame (ORF) being extracted from the input sequence depends on
its GC content. For our three reference genomes, the minimum ORF size
were the following :
-
B. subtilis (GC content of 43%)
: 500 bp
-
E. coli (GC content of 51%)
: 700 bp
- M. tuberculosis (GC content of 66%) :
900 bp
We then first compute the GC% of the input sequence, and as a first approximation,
we used a linear regression (obtained with the values of the three genome models) to determine
the minimum size of the ORFs being extracted.
For a Markov model of a noncoding sequence, we used the zero order model
with four probability parameters estimated by genome specific frequencies
of mononucleotides.
=> the corresponding gene
model is called pre-matrix and is used by AMIGene to predict CDSs.
"Use specific gene models for an organism"
The
AGC (Atelier de Génomique Comparative) group systematically investigates
codon usage differences in genomes, to identify major factors influencing synonymous codons frequencies
within a set of CDSs. The following procedure is performed for complete genomes (public or
non-public) we are interested in :
The multivariate statistical technique of Factorial Correspondence Analysis
(FCA) is used to identify major trends acting on codon usage within the annotated or
predicted CDSs.
The K-Means algorithm is also used on absolute and/or relative codon frequencies
of the coding sequences, k being equal to 2,3 or 4 depending on the contribution of each major factor
to codon usage bias.
The gene classes defined the training sets for protein-coding regions, the
rest of the sequence being included into the noncoding training set. Corresponding
gene models (2,3 or 4 in total) are generated by the prokov-learn program
and subsequently used in the core of the AMIGene method.
To date, this analysis has been done for the following bacterial genomes
:
- Escherichia coli K12
- Bacillus subtilis 168
- Mycobacterium tuberculosis
H37Rv
- Mycobacterium tuberculosis
CDC1551
- Helicobacter pylori 26695
- Yersinia pseudotuberculosis
IP32953 (unpublished)
- Bacillus halodurans
- Escherichia coli O157H7-EDL933
- Salmonella typhimurium
LT2
- Photorhabdus luminescens
(unpublished)
- Acinetobacter sp. str. ADP1 (unpublished)
For these bacterial genomes, genes have been clustered in at least 3 different
classes. By selecting one of these organisms the AMIGene program is going
to used the corresponding gene models for the automatic gene predictions
on your input sequence.
|
|
AMIGene parameters
The core of the AMIGene method uses
several parameter values which are explained in detail below, and summarize
in the following Table :
AMIGene METHOD : ATTRIBUTION OF CODING PROBABILITIES
Given the sequence of a complete
genome, we first look for the maximal CDS (i.e., maximal segments in-frame
between start and stop codons) and the positions in the six reading frames.
The putative CDSs longer than 60 bp are kept (M1, prokov-orf method). Then, the Prokov-curve method (M2) uses in parallel the specific models built
for the gene classes of the genome (see Gene Model section), in order to
compute the coding probabilities of each identified CDS. AMIGene indicates
the model (matrix 1, 2, 3 or 4) that fits best in terms of coding probability.
In order to speed up the following
step, a first selection of CDSs is made at this level, depending only on their coding probability. Two probability thresholds, prob-PC and
sure-PC, are defined (above Table). If the value of the coding probability
is
1) below the prob-PC threshold, the corresponding
CDS is discarded
2) between the prob-PC and sure-PC thresholds, the corresponding CDS is kept
in a list containing the probable CDSs (lst-prob-CDSs),
ONLY IF its length is longer than the Prob-LMin threshold.
3) above the sure-PC threshold, the corresponding CDS is kept in a list containing
the sure CDSs (lst-sure-CDSs).
AMIGene METHOD : CDSs SELECTION HEURISTIC
We defined an heuristic
to select CDSs in order to take into account tricky choices
between two overlapping CDSs, and/or presence of a frameshift error in the
DNA sequence. Our selection of the most likely CDSs consists in the
elimination of the false positives according to overlaping criteria between
adjacent CDSs, these overlaps being either total (they are called inclusion)
or partial.
In order to avoid overlaps due to the choice of the leftmost start codon,
an alternative start can be selected by AMIGene according to
the coding prediction curve climbing.
The AMIGene method is divided into three main steps :
A. step 1
From the list of sure CDSs (lst-sure-CDSs),
we first examine the inclusion cases only. Two inclusion thresholds
are defined according to the transcription orientation of the included CDSs
: the sure-ss-I threshold for two CDSs transcribed in the same strand,
and the sure-os-I threshold for two CDSs transcribed in the opposite
strand. An inclusion percentage, given by the ratio of the size of the smallest
CDS to the size of the longest CDS, is then computed. If this value is below
the threshold corresponding to the transcription orientation of the two CDSs,
the smallest CDS is eliminated. In the figure (A) most of the included CDSs
are eliminated, and only one included CDS is kept, this inclusion being characteristic
of the presence of a compensating frameshift in the sequence.
B. step 2
In a second step we add, to this new list of sure CDSs, the list of probable
CDSs (lst-prob-CDSs). Overlaping
and inclusion cases between a probable CDS and a sure CDS are then
examined, the sure CDSs being always conserved. If a probable CDS is
included in a sure CDS, the probable CDS is always eliminated. If a sure
CDS overlaps a probable CDS (partially or totally), the probable CDS is always conserved (these
situations are represented in the figure B and may point out cases of putative
frameshifts in the DNA sequence). In the case of
overlaping CDSs transcribed
in the opposite strand,
an overlap percentage, given by the ratio of the size of this overlap to
the size of the probable CDS, is computed. If this value is above the overlaping
threshold between a sure and a probable CDS, sure-prob-O, the probable
CDS is eliminated.
C. step 3
In the last step of AMIGene, we look for overlaping cases only between
CDSs of the new list of probable CDSs. As opposed to the previous steps,
it is important here to allow for all overlaps between a CDS with others
probable adjacent CDSs. Therefore, for each probable CDS of the list, a total
overlap score is first computed (it is the sum of the overlap percentages
of the probable CDS being examined with adjacent probable CDSs). Then, the
list of probable CDSs is re-examined and a CDS having a score value above
a total overlaping threshold, prob-glob-IO, is eliminated. An example
of 3 probable CDSs having an important overlap is given in the Figure C.
If the 3 CDSs have a score value above the prob-glob-IO threshold,
the CDS having the highest coding probability will be kept.
SETTING THE THRESHOLD PARAMETER VALUES :
The following table has been obtained
after an optimization process of the threshold values on three reference genomes
annotations. Cases of false negatives are penalized twice (k=2) and ten
times (k=10).
ECOLI
= Escherichia coli K12, BACSU = Bacillus subtilis 168 , MYCTU
= Mycobacterium tuberculosis H37Rv
"Use
specific parameters"
Depending on your selection (Low,
Medium or High GC percent), AMIGene is going to used the set of threshold values
we have obtained for the corresponding reference genome (k=10), that
is to say :
- B. subtilis (BACSU) for a LOW GC content
- E. coli (ECOLI) for a MEDIUM GC content
- M. tuberculosis (MYCTU) for a HIGH GC content
"Choose
your own parameters"
In the above Table we propose
some thresholds taking as an example the analysis an anonymous DNA sequence. The
required parameter values can be selected from a predefined list, and several
ckecks are done before the submission of your query to AMIGene, in order
to avoid some inconsistencies (such as a Prob-Pc > Sure_Pc !).
|
Options
The home page of the AMIGene
results includes the list of predicted CDSs in text format, and a graph reporting
the protein coding potentials (both the position of the CDSs and the coding
prediction curves in the 6 reading frames of the input sequence). The
graph is fully dynamic and allows the user to navigate along the genome
(or contig); the corresponding list of predicted CDSs is
updated accordingly.
->The URL which is sent to the user's e-mail address is available for
one week and allows him to consult AMIGene predictions via this graphical
interface.
In addition, AMIGene creates several output files that can be subsequently
downloaded :
1) one file containing the list of predicted CDSs
2) one fasta file containing the nucleic sequences of the predicted CDSs
3) one faste file containing the corresponding protein sequences : ONLY if the user selects the "Translate CDSs" item in the "Options" section.
4) one file containing the position of putative frameshift errors which have been found
by the ProFED method described above : ONLY if the user selects the "Add FrameShift Errors Detection (ProFED)" item in the "Options" section.
ProFED : Procaryotic Frameshift Errors Detection
The ProFED method allows for frameshifts prediction in raw DNA sequences
or complete bacterial genome
without looking for sequence similarity in databanks. It only uses frame-dependent
properties of the protein
coding regions, namely the stop and the start codon locations combined with
the predicted coding probabilities
in the six reading frames. Our method is thus only based on the intrinsic
properties of the coding sequences.
It combines the results of two analyses : the search for translational initiation/termination
sites and the prediction of coding regions.
The input genome sequence is first split into smaller fragments (of length
L=20 kb) and the following steps are iterated on each DNA fragment :
(i) A CDS searching method is executed (Prokov-orf; A. Viari, personnal communication),
and the positions of the putative CDSs in the six reading frames are kept
(M1).
(ii) this list is then parsed to create, for each reading frame k, a boolean
(0/1) vector Bk of length L, which contains 1 if the position is in a CDS
and 0 if the position is not in a CDS.
(iii) the Prokov-curve method (A. Viari, personnal communication) is
used to produce six numeric vectors Pk, corresponding to the coding probabilities along the DNA fragment for
each of the six frames (M2).
(iv) the previous results are then merged in the following way (M1+M2):
(iv-a) for each numeric vector Pk, a sliding window of width W is moved along
the sequence to calculate the mean of the GeneMark coding probability (mean-gm) at each position i in the
sequence.
(iv-b) if mean-gm exceeds a preset threshold called Tgm, the value Bk(i) of
the boolean vector Bk, at position i, is considered.
(iv-b-1) If Bk(i) is 0 (case A), this indicates a desagreement between the
coding prediction and the location maximal CDS and therefore a putative frameshift introducing
an incorrect in-frame stop codon.
(iv-b-2) if Bk(i) is 1, the program looks for an another frame k' containing,
at the same position i, a boolean vector B'k(i) equal to 1 and a numeric vector P'k with a mean-gm
value greater or equal to the threshold Tgm. This second case (case B) indicates partially
overlapping CDSs that should putatively be merged together.
(iv-c) if one of the conditions (iv-b-1) or (iv-b-2) holds, then position
i is suspected to contain a frameshift error and is reported in the output file. Consecutive erroneous positions
are further merged into a single putative sequencing error.
|