All the versions of this article:
The genome of the fish Tetraodon nigroviridis has been sequenced using the whole genome shotgun sequencing strategy: two centers, Genoscope and the Broad Institute of MIT and Harvard (formerly the Whitehead Institute Center for Genome Research, WICGR), have performed 4.53 million reads of the two extremities of DNA fragments of various sizes from the ensemble of the genome. After selection based on quality, 4.25 million reads were retained for the assembly (2.851 Gb, corresponding to a coverage of 7.9 X for a genome size estimated at 350 Mb). More than 42% of these reads (38% of the bases read) were performed by the WICGR on small inserts cloned in plasmids, and about 54% (61% of the bases read) were performed at Genoscope using the same type of clones. Furthermore, Genoscope has produced about 50,000 reads of the ends of large-size inserts cloned in BACs, in order to obtain clone links at long range for the assembly. The table below details the size of the inserts on which the extremities have been sequenced, the number of reads for each type of insert and the volume of the corresponding sequence (the useful reads of the WICGR are shorter than those of Genoscope).
|Library||Center|| Insert size|
| Useful reads|
| Bases |
|plasmid||Genoscope||2 - 5||1,466||1,125||3,2|
|plasmid||WICGR||2 - 8||1,794||1,092||3,1|
|plasmid||Genoscope||1,5 - 3||0,827||0,603||1,7|
|BAC||Genoscope||100 - 160||0,027||0,018||0,05|
|BAC||Genoscope||120 - 180||0,020||0,013||0,04|
During the assembly stage it was necessary to take the level of polymorphism of the reads into account. In fact, the genetic material came from three different animals: one was used for the plasmid library produced at the WICGR, another for the two plasmid libraries produced by Genoscope and a third for the BAC library at Genoscope. Moreover, these fish, which came from the tropical fish trade, were not from a consanguineous lineage. Because of the large number of polymorphisms, the preliminary assembly of 4 million reads derived from these three animals resulted in an excessive number of redundant contigs. In order to limit the confusion between polymorphisms and sequencing errors or divergences between duplicated regions, a sequential assembly strategy was applied. At first, the WICGR (3.1 X) reads, which were derived from a single individual, were assembled using the Arachne program. The reads from the Genoscope libraries were then incorporated into the assembly in an iterative fashion. In parallel, Genoscope performed an assembly with reads from the two centers, and then compared it with the Arachne assembly using the BLAST program. Contigs which did not exhibit alignments (about 10% of the Genoscope assembly) were added to the Arachne assembly.
At the end of this stage the combined assembly contained 49,609 contigs, representing 312 Mb of sequence. Arachne has ordered these contigs using clone links to form 25,773 supercontigs or scaffolds. These cover 342 Mb, which signifies that there are 30 Mb of gaps which are covered by one or more clones in the scaffolds. The assembly exhibits good long-range continuity: 50% of the bases are included in scaffolds of more than 731 kb (N50 length) and 80% of the bases are in 805 scaffolds of more than 41 kb (N80 length). The largest scaffold is 7.6 Mb, which is the order of length of one chromosome arm in Tetraodon.
In parallel to the whole-genome shotgun sequencing, a physical mapping project was undertaken using the BAC clones, in order to validate and order the scaffolds where they belonged. At the beginning of the project, the whole-genome shotgun strategy had never been used to produce sequences from large genomes, and it seemed reasonable to build the assembly with the aid of a physical map of the Tetraodon genome (genetic mapping is not possible because Tetraodon cannot be bred in captivity). Three mapping strategies were followed:
In addition to data from physical mapping, two other types of information made it possible to link the scaffolds: the screening of scaffolds by pairs of BAC or plasmid end sequences not used in the assembly ; and the alignment of scaffold sequences of Tetraodon on the genomic assembly of Takifugu rubripes. The combination of the ensemble of this data has resulted in the regrouping of scaffolds into “ultracontigs” on the 21 chromosomes of Tetraodon. In total, 1,702 scaffolds were assembled into 128 ultracontigs, which represents 80.5% of the assembly. Of these untracontigs, 39 (64.2% of the assembly) could be anchored on the chromosomes. The contiguity thus obtained is about 50 times better than that of the draft genomic sequence of fugu. The statistics for the assembly are given in the table below:
|Number||N50 length (kb)||Size, gaps included (Mb)||Size, gaps excluded (Mb)||Size of the longest (kb)||Percentage of the genome, gaps included|
|Mapped ultracontigs||39||7 601||218,3||197,7||11 977||64,2|
|All ultracontigs||128||1 382||274,0||247,0||11 977||80,5|
|Mapped scaffolds||1 338||1 382||218,2||197,7||7 612||64,2|
|All scaffolds||25 773||731||342,4||312,4||7 612||100,7|
|Mapped contigs||16 083||26||197,7||197,7||258||58,1|
|All contigs||49 609||16||312,4||312,4||258||91,9|
This large-scale assembly was evaluated by FISH hybridization of pairs of BAC clones chosen near the ends of the 44 largest scaffolds (those which had the highest risk of error). In all cases, the two BAC clones hybridized on the same chromosome, thus validating the scaffolds (in one case, however, the BACs hybridized on either side of the centromere). Furthermore, the portion of euchromatic regions of the Tetraodon genome included in the assembly was evaluated by aligning 1,472 new random reads. An alignment was obtained for 90% of these reads, in which some, despite masking of repeat sequences, could correspond to heterochromatic regions. This signifies that the assembly probably contains more than 90% of the euchromatin.
The annotation was carried out at Genoscope by combining several resources: alignment of protein sequences from three other sequenced vertebrates (Takifugu, mouse, human) on the draft, followed by alignment of the genomic sequences themselves using Exofish; alignment of the end sequences of 155,000 Tetraodon cDNA clones prepared from 7 different tissues from the fish; and finally, ab initio prediction of genes with the Genscan and GeneID programs. All of these annotation resources were combined with the GAZE program (Howe et al., 2002), which produced 34,355 gene models. After elimination of the most obvious artifacts, 27,918 gene models were retained.
A special effort was made for families of genes which posed annotation problems: the selenoproteins on one hand and the type I helical cytokines and their receptors on the other hand. The first are proteins which incorporate the amino acid selenocysteine, coded by the TGA triplet. The problem, therefore, lies in distinguishing these coding triplets from nonsense triplets. Various methods have been used for the identification of the selenoproteins of Tetraodon (see article). They have led to the definition of 18 to 19 families. All the families known in eukaryotes were found, except for one, and a new putative gene, identified by searching for SECIS elements and TGA codons in phase, corresponds to a family of selenoproteins specific to fish, with no equivalent in other vertebrates.
The annotation strategy for type I cytokines and their receptors is based on the specificity of the intron-exon structure, and had been previously validated for the identification of class II cytokines. The class I genes which were identified were confirmed by cloning of their transcripts; they are described on the main page (The gene repertoire of Tetraodon).