A set of 16.8 million stringent probes to cover the initial incomplete
gene set was designed. These probes were used for fecal microbial DNA
capture from 50 individuals from the MetaHit cohort, and resulting
libraries of captured DNA were sequenced using Illumina HiSeq (2*100 bp)
paired-end protocol.
A dedicated global iterative assembly of co-localized extremity paired
end reads approach have been specifically designed and implemented in the
DIGEST (Directed Iterative Gene Extension by Sequencing capture Technology)
pipeline. In short, the main steps are:
- selection of informative
reads for extension, "overlapping reads" (i.e. pairs of reads for which
one end matches one extremity of a gene to be extended) and discard of
"internal reads" and "external reads"
(BWA),
- assembly of these reads
(RAYMETA),
- merging of resulting contigs with initial genes
(BWA-MEM).
The extended contigs, together with the set of yet unextended genes, are
used as the input dataset for the next iteration. Process stops whenever
the set of "external reads" is empty or remains unchanged. From the
resulting contigs, genes are identified using an ORF finder software and
then clustered; this leads to a new gene catalogue per individual. Finally
all gene catalogues are merged into a non-redundant gene catalogue.