BIRDS aims to automate the execution of bioinformatics treatments.Given its ability to automatically manage a large number of bioinformatics in parallel and to maintain detailed records of treatments, it is particularly suited to a production environment. BIRDS is a very flexible API using the rules engine developed by JBoss Drools making BIRDS project an expert based business rules system. BIRDS project is scalable by adding new rules created by users. Business logic separated from the code and expressed in the form of business rules understandable by all users, facilitates long-term maintenance. BIRDS API contains essentially the logic of creating and configuring treatment and execution jobs.
In BIRDS, is the availability of new resources that launch the execution of bioinformatics programs. Birds design is centered on treatment configuration (xml files). Once configured (ie connected to resource referential) and declared by the administrator, BIRDS process generate new jobs if resources required by each type of input treatment are available. These jobs are submitted to job scheduler or local execution depending on execution context defined in configuration files. New resources generate by jobs execution will be stored in referential resources attached to the outputs according with configuration. These resources are stored in database as consumed resources in order not to build treatments already processed, and also to certify the way the results have been obtained.
BIRDS reduce the complexity of workflow design based on dataflow interdependencies configuration and the data availability to launch automation of building and executing job.
Business rules are written by the user using the Drools technology. These rules are interpreted by BIRDS at runtime while being written outside BIRDS allowing each user to define their own rules in a specific context. It allows defining business rules in a separate way and delegate all automation functionalities to BIRDS. These rules act at several stages : resources selection strategy, command line building, pre and post job execution,when error occurs by alerting or relaunching job...
Treatment defines what user wants to run in BIRDS. It define once in configuration file settings for the program to automate. It is a description of inputs, outputs and parameters of an executable program. Parameters can also be defined at runtime depending on specific context (eg input resources) in rules.
A resource is the data needed to run the program. Once a resource is available and defined treatments, BIRDS automatically create jobs to run.
BIRDS resource is characterized by a type and a set of key-value pairs.The minimum declaration of a resource type is to declare type name. The definition of a new resources type should be considered because it affects the creation and execution of new jobs.
BIRDS resource must be connected to a resource referential. Resource referential is data support (database, remote server, flat file, ...) that host resources. For each resource from a referential resource, BIRDS resource is created containing a set of information as key/value pairs according to the type required for jobs execution using this resource. Resource referential host resources in several type of resources and can be internal or external depending on whether resources are stored in internal BIRDS database or external referential. Referential must be known and therefore be declared once in configuration file.
A BIRDS job is created with all the information necessary for the execution of a treatment usually represented as a command line. Jobs are built by combining different resources from each input (Cartesian product)
For example, consider the treatment specification of a Blast program that compares a sequence against a public bank sequence (cf fig 2). This treatment specification defines two inputs and an output which produces the results of the alignment between the sequences . The first input admits resource declared as type 'SEQ' for sequences and the second input declared as type 'Bank' for public sequences. Suppose you connect the first input on a resource referential providing two sequences of type ' SEQ ' at a given moment and the second input on a resource referential connected to the bank providing three public sequence of type 'Bank'. We therefore in this example 2x3 = 6 possible sets of resources, a resource set being a BIRDS resource obtained by crossing each retrieved entry. In this example, the treatment specification generate six jobs to run.
BIRDS job are builded from a set of parameters defined by users in configuration xml files. BIRDS API offers services to add configuration in BIRDS database to take into account in BIRDS process. Two types of configuration are required :
<Declaration xmlns="http://www.genoscope.cns.fr/specification" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.genoscope.cns.fr/specification> <ResourceType name="SEQ"> <RequiredProperty propertyName="name" /> <ResourceType> <ResourceType name="Bank"/> <ResourceType name="BLAST_OUTPUT"/> <Referential name="CABRI" device="CABRI_DEVICE"> <ResourceType name=”Bank" /> <Referential> <Referential name="LIMS" device=" LIMS_DEVICE" readOnly=”false”> <ResourceType name=”SEQ" /> </Referential> <Referential name="BLAST_RESULTS" device="BIRDS_DEVICE" readOnly=”false”> <ResourceType name=”BLAST_OUTPUT" /> </Referential> <ReferentialDevice name= ‘CABRI_DEVICE’ device=’fr.genoscope.lis.devsi.birds_extension.device.CabriDevice’ /> <ReferentialDevice name= ‘LIMS_DEVICE’ device=’fr.genoscope.lis.devsi.birds.device.DatabaseDevice’> <Connection url=’jdbc :jtds :sybase://host:port/database’ passwd=’****’ login=’user’ driver=’com.sybase.jdbc3.jdbc.SybDriver’/> </ReferentialDevice> <ReferentialDevice name= ‘BIRDS_DEVICE’ device=’fr.genoscope.lis.devsi.birds.device.InternalReferentialDevice’ /> <Project name="Tara" workspace="%PROJECT_WORKSPACE"/> </Declaration>
<Treatment name="Blast" checkIfNewTreatmentAlreadyPerformed=”true” useJobScheduler=”true” > <ExecutableSpecification > <Executable path="/env/cns/proj/Tara/blast.sh" version="1.0" commandSyntaxStrategyName="default" user=”userToConnect” host=”hostname" /> <Slurm job="blast" account="blast_ccrt" part="normal" time="5:0:0" nb.tasks="1" nb.cpu="4" mem.cpu="48" nb.tasks.core="1" groupOutput="true"/> </ExecutableSpecification> <ParametersSpecification > <KeyValueElement key=’-p’ value=’2’ /> <LineValueElement parameterName=’param1’ line=’-p 2 -q 3’ /> </ParametersSpecification > <InputsSpecification> <InputElement name="bank_blast_input" count="1" resourceType="Bank"> <Referential name="CABRI" /> </InputElement> <InputElement name="blast_input" count="1" resourceType="SEQ"> <Referential name="LIMS_PROD" /> </InputElement> </InputsSpecification> <OutputsSpecification> <OutputElement name="blast_output" resourceType="BLAST_OUTPUT" /> <Referential name="LIMS_PROD" /> </OutputElement> </OutputsSpecification> </Treatment>
BIRDS client is a java application, a process started by the administrator.
These process will automate the generation and execution of jobs based on treatment configuration.
BIRDS client consists of two parts (cf fig 3) :
The business logic is defined by the user outside BIRDS API thanks Drools rules engine rules. BIRDS processes interprete rules at runtime at several stages allowing user to define a specific context according with the current resources or job (cf fig 4). For example, user can define rules at runtime to filter resources at stage "resource selection" or to calculate a parameter at stage "building command line".
Exemple of selection resources rules
rule "selection lotSeq from database device LIMS" @BirdsRule( selectionRule ) dialect 'java' salience 300 when $input : InputSpecificationElement( name == "bank_blast_input", treatmentSpecification.name == "Blast") $resourcesReferential : ResourcesReferential(name=="CABRI") $device : DatabaseDevice() from $resourcesReferential.referentialDevice $rps : ResourcePropertiesSet(initialized==false, inputSpecificationElement==$input, resourcesReferential== $resourcesReferential) then $rps.initialize(); Set<ResourceProperties> resourcesPropertiesSet = $device.getPropertiesExecuteQuery("SELECT * from cabri_table"); $rps.addResourcePropertiesSet(resourcesPropertiesSet); modify($rps){}; end
BIRDS Processes handle errors encountered by sending signals error to the main workflow (process of resource management). These signals allow breakpoints in the workflow, but can also be intercepted by users rules. Users can also define an action when errors occurs, for example, alerting by email or relaunch job.
Other features are present in BIRDS API and will be described in the future :
A group of treatment specification is a specification which involves a multi-treatment specifications. As a treatment specification, group treatment specification admits inputs and outputs resources.
It allows to have the general progression of a resource through each treatment. Group treatment is finished once its internal treatment completed
It permits to remove internal job from history once the group execution is finished
It allows error recovery at a specific step.