Next Generation Sequencing technologies generate a high volume of primary data that needs to be processed by several bioinformatics tools against public or private resources. Downstream data will also need to be handle by other programs, before building new biological knowledge. Existing systems dedicated to manage workflow (Taverna, Galaxy ) mainly focus on design and execution but not necessary on automation and parallelization (high demand). Here we present a new java API designed to automate the generation, execution and management of bioinformatics treatments in production mode driven by rules applied on resources’ availability: BIRDS (Bioinformatics Rules Driven System).
The main advantage of BIRDS is its capacity to manage data on any support (file system, database system…) which does require neither data formatting convention nor metadata declaration, through the Resource concept. Thereby, the biocomputer scientist declares set of resources to be processed by a set of workflows. Each workflow, made of several modules is orchestrated by BIRDS and will be iteratively queried to know resource’s availability, if so, BIRDS will push its process by using a set of dedicated business rules. These rules ensure the creation of the job with its options but also manage the job’s life cycle.
BIRDS support large-scale analysis through his multi-thread and cluster execution implementation. Different job execution context for each treatment could be defined from local execution to cluster execution using different job schedulers (currently support of LSF and SLURM). Jobs building and execution information are stored in a dedicated database to allow job monitoring and treatment configuration exploration. The database maintains also all treatments history, especially consumed resources in order not to build treatments already processed, and also to certify the way the results have been obtained.
BIRDS integrate a web user interface to explore and manage jobs: monitor information related to the execution status, launch manually a job, and explore details of job (command line, input and output data, log files).
BIRDS is currently used in the SynBioWatch project, a collaboration between CEA and Direction Générale de l’Armement Maîtrise NRBC (DGA). SynBioWatch is an innovative NGS screening platform that automatically identifies Select Agent organisms in order to alert threat response teams. The platform is made of three main workflows: 1 - A pre-processing step which filters out sequence «reads» and removes all uninformative sequences (such as duplicated or low quality reads); 2 - A comparison step where all «reads» are compared to a comprehensive data set of sequences from public and specialized databases; 3 - An automated rule-based classifier that raises alerts if needed (not yet developed). The end-user can visualize results on taxonomy reports displayed by Krona and directly in a text report for genome coverage analysis. All workflows are made of several biocomputing modules that are managed directly by BIRDS.
BIRDS is much more than automation, it has been designed to be flexible and adaptable to any business using the power of the rules engine Drools. It allows defining business rules in a separate way and delegate all automation functionalities to BIRDS These rules act at several stages from command line building, to pre and post job execution, or when error occurs by alerting or relaunching job for example. BIRDS allow gaining in synchronization, robustness, control and traceability in the execution of million of jobs on HPC clusters. BIRDS project aims to be available to community in open source mode.