Prerequisites | Reflow Maven Skin

DIGEST has been developped for the CCRT architecture and approved for python 2.7.x.

module load python/2.7.3

module load python/2.7.8

DIGEST automatically load france genomique (fg) environment with its ray, samtools and bwa modules. CD-HIT (v4.5.8-2012-03-24) and MetaGene have already been compiled in digest-ccrt/bin and must be in PATH.

export PATH=$PATH:digest-ccrt/bin/

Furthermore, DIGEST needs the DIGEST_functions.py in PYTHONPATH to be run.

export PYTHONPATH=$PYTHONPATH:digest-ccrt/src/main/scripts/DIGEST_functions.py

DIGEST

For the main steps DIGEST here . To run DIGEST just launch DIGEST.py:

python src/main/scripts/DIGEST.py

This script launch jobs with ccc_msub command and manages dependencies.

usage: DIGEST.py [-h] [-source DIGEST_HOME] [-R REFERENCE] [-1 PAIR1]
                 [-2 PAIR2] [-o PREFIX] [-qual MINMAPQ] [-k KMERLENGTH]
                 [-n LIMLENGTH] [-p PREDICTOR] [-c CLUTSERTHRESHOLD]
                 [-aS MINALIG] [-A PROJID] [-q QUEUE] [-t PROCESSORS]
                 [-N NODES] [--loop] [-stoploop STOPLOOP]

Main script for DIGEST workflow

optional arguments:
  -h, --help           show this help message and exit
  -source DIGEST_HOME  digest-ccrt path
  -R REFERENCE         reference in FASTA format with its index files in the
                       same folder
  -1 PAIR1             1st FASTQ file from a pair
  -2 PAIR2             2nd FASTQ file from a pair
  -o PREFIX            output prefix (default:output)
  -qual MINMAPQ        MAPQ min to keep alignment (default:30)
  -k KMERLENGTH        kmer length for Ray (default:27)
  -n LIMLENGTH         min length for partial ORF (default=100)
  -p PREDICTOR         genes predictor tool to use : 'mga' (MetaGeneAnnotator)
                       or 'prodigal' (default:mga)
  -c CLUTSERTHRESHOLD  sequence identity threshold for clustering
                       (default:0.95)
  -aS MINALIG          alignment coverage for the shorter sequence
                       (default:0.0) if set to 0.9,the alignment must covers
                       90 pourcent of the sequence
  -A PROJID            CCRT project/account name (default:None)
  -q QUEUE             Job Priority (default:large)
  -t PROCESSORS        Numbers of threads requested to run each job
                       (default:16)
  -N NODES             maximum number of nodes to use together (default:1)
  --loop               DIGEST loop (default:False)
  -stoploop STOPLOOP   if loop, max number of loops (default:None)

The reference (-R) argument corresponds to partial gene catalogue to extend. This reference must previously be index by BWA. Index files must be in the same folder and have the same prefix as reference fasta file. Example:

 ls DATA/
Ref.fasta Ref.fasta.amb Ref.fasta.ann Ref.fasta.bwt Ref.fasta.pac Ref.fasta.sa

DIGEST embeds MetaGeneAnnotator to predict genes but DIGEST can also use Prodigal with the '-p prodigal' optional argument. Please note, unlike MetaGeneAnnotator, Prodigal is not included in the DIGEST workflow and must be in $PATH.

By default, DIGEST runs jobs on 1 node and uses 16 cores (queue = large : 1 node = 16 cores). For big data, we recommend to specified more than 1 node to accelerate the clustering step. Otherwise, jobs can be stopped by the CCRT due to the time limit. In return, more jobs are submitted.

Output

DIGEST produce a PREFIX_DIGEST folder with 3 subfolder : Process, Reads and Result.

Process

Assembly : Ray Meta output files and the contigs bwa index.
MappingReadsOnTargets : PREFIX_sorted.bam - the mapping of reads against the reference data set sorted by reads name, in bam format.
MappingTargetsOnContigs : PREFIX_BWAmem.bam - the mapping of initial data set against Ray contigs, in bam format.
ORFdetection : PREFIX_Extended.fasta - extended contigs ; PREFIX_metagene.txt - metagene ORF prediction of extended contigs.
jobProcess : bash scripts submitted with their error and output files.

Reads

PREFIX_overlap_p1.fasta and PREFIX_overlap_p2.fasta - pairs of reads for which one end matches one extremity of a gene to be extended.
PREFIX_unmap_p1.fasta and PREFIX_unmap_p2.fasta - pairs of reads which don't match on initial data set.

Result

PREFIX_complete_RefCluster.fasta - genes completed and clustered.
PREFIX_incomplete_RefCluster.fasta - genes incompleted and clustered.
PREFIX_unmappedTarget.fasta - inital partial genes unmapped on Ray contigs.

errorProcess.txt (optional) - generated if DIGEST encounters and error and write it.

DIGEST loop

With de --loop argument, DIGEST can restart automatically at the end of a round. After the first iteration, PREFIX_incomplete_RefCluster.fasta and PREFIX_unmappedTarget.fasta are merged and indexed to form a new reference data set. Only unmapped and overlapped reads are reused.

A folder is generated for each iteration and named LOOPX_PREFIX_DIGEST (with X = iteration number). This folder combining all DIGEST output for this loop. For each iteration, une new line is written in the PREFIX_DIGEST-loop.txt file like this :

loopX : Y complete - Z partial

Which X = iteration number, Y = number of completed and clustered genes, and Z = number of incompleted and clustered genes.

DIGEST stop when there are no more reads, completed or incompleted genes, or if a errorProcess.txt file are created. DIGEST can also stop if the number of completed genes remain constant. Finaly, DIGEST merges and clusters all completed genes of each iteration in total_completeORF.fasta. All folders and files are combining in the PREFIX_DIGEST folder.

	France Genomique