Lien pour acceder au site du CEA
Site Genoscope en langue française Genoscope site in english El sitio Genoscope en español
Home page > Sequencing > Projects > Animals > Tetraodon nigroviridis > Whole genome shotgun

All the versions of this article:

Tetraodon nigroviridis

Whole genome shotgun

The genome of the fish Tetraodon nigroviridis has been sequenced using the whole genome shotgun sequencing strategy: two centers, Genoscope and the Broad Institute of MIT and Harvard (formerly the Whitehead Institute Center for Genome Research, WICGR), have performed 4.53 million reads of the two extremities of DNA fragments of various sizes from the ensemble of the genome. After selection based on quality, 4.25 million reads were retained for the assembly (2.851 Gb, corresponding to a coverage of 7.9 X for a genome size estimated at 350 Mb). More than 42% of these reads (38% of the bases read) were performed by the WICGR on small inserts cloned in plasmids, and about 54% (61% of the bases read) were performed at Genoscope using the same type of clones. Furthermore, Genoscope has produced about 50,000 reads of the ends of large-size inserts cloned in BACs, in order to obtain clone links at long range for the assembly. The table below details the size of the inserts on which the extremities have been sequenced, the number of reads for each type of insert and the volume of the corresponding sequence (the useful reads of the WICGR are shorter than those of Genoscope).

Library Center Insert size
Useful reads
plasmid Genoscope 2 - 5 1,466 1,125 3,2
plasmid WICGR 2 - 8 1,794 1,092 3,1
plasmid Genoscope 1,5 - 3 0,827 0,603 1,7
BAC Genoscope 100 - 160 0,027 0,018 0,05
BAC Genoscope 120 - 180 0,020 0,013 0,04
Total    4,254 2,851 8,1


During the assembly stage it was necessary to take the level of polymorphism of the reads into account. In fact, the genetic material came from three different animals: one was used for the plasmid library produced at the WICGR, another for the two plasmid libraries produced by Genoscope and a third for the BAC library at Genoscope. Moreover, these fish, which came from the tropical fish trade, were not from a consanguineous lineage. Because of the large number of polymorphisms, the preliminary assembly of 4 million reads derived from these three animals resulted in an excessive number of redundant contigs. In order to limit the confusion between polymorphisms and sequencing errors or divergences between duplicated regions, a sequential assembly strategy was applied. At first, the WICGR (3.1 X) reads, which were derived from a single individual, were assembled using the Arachne program. The reads from the Genoscope libraries were then incorporated into the assembly in an iterative fashion. In parallel, Genoscope performed an assembly with reads from the two centers, and then compared it with the Arachne assembly using the BLAST program. Contigs which did not exhibit alignments (about 10% of the Genoscope assembly) were added to the Arachne assembly.

At the end of this stage the combined assembly contained 49,609 contigs, representing 312 Mb of sequence. Arachne has ordered these contigs using clone links to form 25,773 supercontigs or scaffolds. These cover 342 Mb, which signifies that there are 30 Mb of gaps which are covered by one or more clones in the scaffolds. The assembly exhibits good long-range continuity: 50% of the bases are included in scaffolds of more than 731 kb (N50 length) and 80% of the bases are in 805 scaffolds of more than 41 kb (N80 length). The largest scaffold is 7.6 Mb, which is the order of length of one chromosome arm in Tetraodon.

In parallel to the whole-genome shotgun sequencing, a physical mapping project was undertaken using the BAC clones, in order to validate and order the scaffolds where they belonged. At the beginning of the project, the whole-genome shotgun strategy had never been used to produce sequences from large genomes, and it seemed reasonable to build the assembly with the aid of a physical map of the Tetraodon genome (genetic mapping is not possible because Tetraodon cannot be bred in captivity). Three mapping strategies were followed:

  • Hybridization of 3,000 probes derived from BAC end sequences on high-density filters containing 55,000 BAC clones. A total of 903 BAC contigs could be defined in this way.
  • Fingerprinting of 32,991 clones and comparison of the restriction profiles: the identification of overlapping clones made it possible to define 2,658 BAC contigs.
  • Hybridization of pairs of BAC clones on the Tetraodon chromosomes (two colors FISH). A total of 117 BAC clones were used in 392 different combinations.

In addition to data from physical mapping, two other types of information made it possible to link the scaffolds: the screening of scaffolds by pairs of BAC or plasmid end sequences not used in the assembly ; and the alignment of scaffold sequences of Tetraodon on the genomic assembly of Takifugu rubripes. The combination of the ensemble of this data has resulted in the regrouping of scaffolds into “ultracontigs” on the 21 chromosomes of Tetraodon. In total, 1,702 scaffolds were assembled into 128 ultracontigs, which represents 80.5% of the assembly. Of these untracontigs, 39 (64.2% of the assembly) could be anchored on the chromosomes. The contiguity thus obtained is about 50 times better than that of the draft genomic sequence of fugu. The statistics for the assembly are given in the table below:

  Number N50 length (kb) Size, gaps included (Mb) Size, gaps excluded (Mb) Size of the longest (kb) Percentage of the genome, gaps included
Mapped ultracontigs 39 7 601 218,3 197,7 11 977 64,2
All ultracontigs 128 1 382 274,0 247,0 11 977 80,5
Mapped scaffolds 1 338 1 382 218,2 197,7 7 612 64,2
All scaffolds 25 773 731 342,4 312,4 7 612 100,7
Mapped contigs 16 083 26 197,7 197,7 258 58,1
All contigs 49 609 16 312,4 312,4 258 91,9

This large-scale assembly was evaluated by FISH hybridization of pairs of BAC clones chosen near the ends of the 44 largest scaffolds (those which had the highest risk of error). In all cases, the two BAC clones hybridized on the same chromosome, thus validating the scaffolds (in one case, however, the BACs hybridized on either side of the centromere). Furthermore, the portion of euchromatic regions of the Tetraodon genome included in the assembly was evaluated by aligning 1,472 new random reads. An alignment was obtained for 90% of these reads, in which some, despite masking of repeat sequences, could correspond to heterochromatic regions. This signifies that the assembly probably contains more than 90% of the euchromatin.


The annotation was carried out at Genoscope by combining several resources: alignment of protein sequences from three other sequenced vertebrates (Takifugu, mouse, human) on the draft, followed by alignment of the genomic sequences themselves using Exofish; alignment of the end sequences of 155,000 Tetraodon cDNA clones prepared from 7 different tissues from the fish; and finally, ab initio prediction of genes with the Genscan and GeneID programs. All of these annotation resources were combined with the GAZE program (Howe et al., 2002), which produced 34,355 gene models. After elimination of the most obvious artifacts, 27,918 gene models were retained.

A special effort was made for families of genes which posed annotation problems: the selenoproteins on one hand and the type I helical cytokines and their receptors on the other hand. The first are proteins which incorporate the amino acid selenocysteine, coded by the TGA triplet. The problem, therefore, lies in distinguishing these coding triplets from nonsense triplets. Various methods have been used for the identification of the selenoproteins of Tetraodon (see article). They have led to the definition of 18 to 19 families. All the families known in eukaryotes were found, except for one, and a new putative gene, identified by searching for SECIS elements and TGA codons in phase, corresponds to a family of selenoproteins specific to fish, with no equivalent in other vertebrates.

The annotation strategy for type I cytokines and their receptors is based on the specificity of the intron-exon structure, and had been previously validated for the identification of class II cytokines. The class I genes which were identified were confirmed by cloning of their transcripts; they are described on the main page (The gene repertoire of Tetraodon).


  • K.L. Howe, T. Chothia & R. Durbin (2002), GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Research 12, 1418-1427.
Last update on 11 January 2008

© Genoscope - Centre National de Séquençage
2 rue Gaston Crémieux CP5706 91057 Evry cedex
Tél:  (+33) 0 1 60 87 25 00
Fax: (+33) 0 1 60 87 25 14

Home | Overview | Projects | News | Press Panorama | Resources | Contact
Follow-up of the site's activity RSS 2.0 | Site Map | Credits | Copyright