Exploring the human gut microbiota composition and diversity remains a challenge in understanding the relationships with human health and diseases. Two international projects (MetaHit and HMP) produced gene catalogues (about 8 and 5 M of genes respectively) that were merged into a set of almost 12 million non-redundant genes of which 56% were incomplete. In order to complete these genes we applied a targeted re-sequencing strategy using the Roche-NimbleGen's sequence capture technology and we implemented a bioinformatics pipeline dedicated to the analysis of this data.



A set of 16.8 million stringent probes to cover the initial incomplete gene set was designed. These probes were used for fecal microbial DNA capture from 50 individuals from the MetaHit cohort, and resulting libraries of captured DNA were sequenced using Illumina HiSeq (2*100 bp) paired-end protocol.

A dedicated global iterative assembly of co-localized extremity paired end reads approach have been specifically designed and implemented in the DIGEST (Directed Iterative Gene Extension by Sequencing capture Technology) pipeline. In short, the main steps are:
  1. selection of informative reads for extension, "overlapping reads" (i.e. pairs of reads for which one end matches one extremity of a gene to be extended) and discard of "internal reads" and "external reads" (BWA),
  2. assembly of these reads (RAYMETA),
  3. merging of resulting contigs with initial genes (BWA-MEM).
The extended contigs, together with the set of yet unextended genes, are used as the input dataset for the next iteration. Process stops whenever the set of "external reads" is empty or remains unchanged. From the resulting contigs, genes are identified using an ORF finder software and then clustered; this leads to a new gene catalogue per individual. Finally all gene catalogues are merged into a non-redundant gene catalogue.

DIGEST has been applied on one individual. The 200 million paired end reads encompassed 2.8 million out of the 12 million genes catalogue, 48% of which were incomplete. 10% were "overlapping reads" and their assembly produced 636,813 contigs. After a single iteration, 16% of these 2.8 million genes (i.e. 437,490 genes) were extended to completion, and an additional set of 147,402 complete and 77,617 uncompleted genes was generated. The procedure is being applied to the 49 other individuals to build a consolidated gene catalogue which could be used for accurate profiling and functional annotation in integrated platform such as MicroScope. Specific completion and extension of the microbiome gene catalog provides potentially important targets for diagnosis purpose.

By specifically targeting the sequencing on interesting part of any genomics or metagenomics dataset, this approach can be of great value for large scale projects. Moreover, the design of the DNA capture step, by the limited number of occurrences of each of the different probes, eases the access to the rare fraction of a complex sample. Apart from metagenomics studies, this methodology can be used for any prokaryotic or eukaryotic genomic sequence finishing and re-sequencing genomes.

Explore the preliminary gene catalogue of the first indvidual processed by DIGEST: Contigs taxonomy assignation and Contigs Functional assignation