The eukaryotic genome analysis team studies the structure and evolution of eukaryotic genomes [using data] from sequencing projects, in partnership with laboratories from the Institut Genomique, or in collaboration with outside laboratories. For these analyses, three main aspects have been developed: assembly, annotation and genomic analysis.
Starting with collections of random reads from a genome sequencing project, known as WGS (Whole Genome Shotgun), the assembly step has the goal of reconstituting the sequence of the chromosomes of the organism to be studied. The algorithms used are based on identity relationships (similarities) between overlapping reads and on topology information provided by the “clone links” or markers from genetic or physical maps. The result of the assembly, an ensemble of “supercontigs”, is a consensus reconstruction of the original sequence.
The tools and methods developed by the group for this activity come from informatics developments realized at the Institut Genomique, or research carried out by other groups involved in assembly problematics such as the “Arachne” program developed at the Broad Institute (www.broad.mit.edu).
Annotation has the objective of defining the structure of the genes in the assembled sequences, i.e. their positions from beginning to end, and their exons. We have chosen an approach which uses various types of information which are a priori undefined. However, this is classified into three large categories:
1/ Ab initio predictions. For each genome we calibrate and utilize several gene prediction programs which use data on statistical properties of protein-coding genes which are known for the species. Preliminary calibration is carried out using a collection of known genes.
2/ Exploitation of coding sequences. We align the ensemble of public (published???/OR: proteins available in public databases???) proteins as well as cDNA sequences available for related phyla. We place more statistical weight on cDNA collections from the same species, either from public databases or from Genoscope. Finally, the alignment is carrried out using programs which constrain exon junctions to sites which are compatible with splicing borders.
3/ Comparative genomics. Depending on the phylum, we add results from alignments between genomes for which a previous calibration has permitted preferential retention of coding regions. This principle is based on better conservation of coding regions compared with non-coding regions. We have developed a tool for this, Exofish, which was historically developed to detect genes in the human sequence using the sequence of the pufferfish, Tetraodon. This work led to the first re-estimation of the number of human genes.
The ensemble of these predictions is then “reconciled” in order to retain only one “gene model” per locus. This step is performed by exploiting the possibilities of utilization of the Gaze program. This tool integrates an ensemble of weighted information which feeds into an automat which we adapt. Using dynamic programming, this step guarantees production of an ensemble of gene models for each sequence without phase rupture and with a maximum score.
The results of the various analyses are stored in a database and are accessible to collaborators via a dedicated interface, a GGB (Generic Genome Browser) navigator.
We have analyzed the DNA sequence of the fish, Tetraodon nigroviridis because of its very small size, 8 times smaller than the human sequence. The level of conservation of genes of these two species after 400 million years of evolution since their separation made it possible for us to estimate the number of genes in humans in 2000. In 2004, the reconstitution in silico of the chromosomes of Tetraodon provided evidence for a whole genome duplication event in this lineage. This duplication, which is called 3R by evolutionary biologists, had been previously suggested as one of the hypotheses which could explain the success of the teleost bony fish group, with their large number of species adapted to numerous different ecosystems.
For each species we perform a certain number of analyses relating to structural, functional and/or evolutionary characterization, in collaboration with other laboratories. We have developed savior-faire in the discovery of ancestral total genome duplication events (WGD) and other polypoloidizations. This type of evolutionary event is thought to be an essential agent in the acquisition of new functions and in the emergence of new species. Major evolutionary lineages such as those of teleost vertebrates or angiosperm plants almost certainly derive from polyploidizations. For these studies, the sequences of the genomes of the fish Tetraodon nigroviridis, the grapevine Vitis vinifera and the ciliate Paramecium tetraurelia are excellent models.
The sequence of the genome of the macronucleus of Paramecium spectacularly conserves the trace of at least 3 whole genome duplications during evolution (the outer part of circle represents more recent events, inner parts of the circle, more ancient). Although very few duplicated genes remain following whole genome duplications (fish, plants, yeast), in this case 24 000 genes, representing 68% of the total, are maintained in 2 copies following the most recent duplication. Furthermore, very few chromosome rearrangements have occurred because the order of the genes has been preserved. These characteristics, essentially the large number of genes duplicated at three different evolutionary periods, show that the loss of genes is strongly constrained over the short term. In particular, the stochiometric effects on genes implicated in interactions is very strong.