BIRDS aims to automate the execution of bioinformatics treatments.Given its ability to automatically manage a large number of bioinformatics in parallel and to maintain detailed records of treatments, it is particularly suited to a production environment. BIRDS is a very flexible API using the rules engine developed by JBoss Drools making BIRDS project an expert based business rules system. BIRDS project is scalable by adding new rules created by users. Business logic separated from the code and expressed in the form of business rules understandable by all users, facilitates long-term maintenance. BIRDS API contains essentially the logic of creating and configuring treatment and execution jobs.

  • fig 1 : BIRDS environment

In BIRDS, is the availability of new resources that launch the execution of bioinformatics programs. Birds design is centered on treatment configuration (xml files). Once configured (ie connected to resource referential) and declared by the administrator, BIRDS process generate new jobs if resources required by each type of input treatment are available. These jobs are submitted to job scheduler or local execution depending on execution context defined in configuration files. New resources generate by jobs execution will be stored in referential resources attached to the outputs according with configuration. These resources are stored in database as consumed resources in order not to build treatments already processed, and also to certify the way the results have been obtained.

BIRDS reduce the complexity of workflow design based on dataflow interdependencies configuration and the data availability to launch automation of building and executing job.

Business rules are written by the user using the Drools technology. These rules are interpreted by BIRDS at runtime while being written outside BIRDS allowing each user to define their own rules in a specific context. It allows defining business rules in a separate way and delegate all automation functionalities to BIRDS. These rules act at several stages : resources selection strategy, command line building, pre and post job execution,when error occurs by alerting or relaunching job...

Treatment

Treatment defines what user wants to run in BIRDS. It define once in configuration file settings for the program to automate. It is a description of inputs, outputs and parameters of an executable program. Parameters can also be defined at runtime depending on specific context (eg input resources) in rules.

Resource

A resource is the data needed to run the program. Once a resource is available and defined treatments, BIRDS automatically create jobs to run.

BIRDS resource is characterized by a type and a set of key-value pairs.The minimum declaration of a resource type is to declare type name. The definition of a new resources type should be considered because it affects the creation and execution of new jobs.

BIRDS resource must be connected to a resource referential. Resource referential is data support (database, remote server, flat file, ...) that host resources. For each resource from a referential resource, BIRDS resource is created containing a set of information as key/value pairs according to the type required for jobs execution using this resource. Resource referential host resources in several type of resources and can be internal or external depending on whether resources are stored in internal BIRDS database or external referential. Referential must be known and therefore be declared once in configuration file.

Job

A BIRDS job is created with all the information necessary for the execution of a treatment usually represented as a command line. Jobs are built by combining different resources from each input (Cartesian product)

For example, consider the treatment specification of a Blast program that compares a sequence against a public bank sequence (cf fig 2). This treatment specification defines two inputs and an output which produces the results of the alignment between the sequences . The first input admits resource declared as type 'SEQ' for sequences and the second input declared as type 'Bank' for public sequences. Suppose you connect the first input on a resource referential providing two sequences of type ' SEQ ' at a given moment and the second input on a resource referential connected to the bank providing three public sequence of type 'Bank'. We therefore in this example 2x3 = 6 possible sets of resources, a resource set being a BIRDS resource obtained by crossing each retrieved entry. In this example, the treatment specification generate six jobs to run.

  • fig 2 : Resources combinaison

BIRDS job are builded from a set of parameters defined by users in configuration xml files. BIRDS API offers services to add configuration in BIRDS database to take into account in BIRDS process. Two types of configuration are required :

  • Admistrative : define all informations about projects and resources
  • Treatment specification : describe input(s)/output(s) and parameter(s) for an executable in order to generate command line. Each input and output are connected to one or multiple resources referential and support only one resource type which are defined in administrative configuration.
Administrative configuration

  • Business project : contains a set of treatments. Defined by a unique name and workspace where would be written standard output and error. Properties can be added to project as key/value pairs.
  • Resource type : defined by a unique name. User can define mandatory properties.
  • Resources referential : defined by a unique name and a connexion device. User could list all resource type accepted by the referential.
  • Connexion device : specify the implementation to connect to a resource referential. By default, BIRDS offers differents implementations : database, file, JSON... But users can define their own implementation.

<Declaration xmlns="http://www.genoscope.cns.fr/specification" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.genoscope.cns.fr/specification>
	
<ResourceType name="SEQ">
	<RequiredProperty propertyName="name" />
<ResourceType>
					
<ResourceType name="Bank"/>
						
<ResourceType name="BLAST_OUTPUT"/>
					
<Referential name="CABRI" device="CABRI_DEVICE">
	<ResourceType name=”Bank" />
<Referential>
					
<Referential name="LIMS" device=" LIMS_DEVICE" readOnly=”false”>
	<ResourceType name=”SEQ" />
</Referential>
					
<Referential name="BLAST_RESULTS" device="BIRDS_DEVICE" readOnly=”false”>
	<ResourceType name=”BLAST_OUTPUT" />
</Referential>
					
<ReferentialDevice name= ‘CABRI_DEVICE’ device=’fr.genoscope.lis.devsi.birds_extension.device.CabriDevice’ />
					
<ReferentialDevice name= ‘LIMS_DEVICE’ device=’fr.genoscope.lis.devsi.birds.device.DatabaseDevice’>
	<Connection url=’jdbc :jtds :sybase://host:port/database’ passwd=’****’ login=’user’ driver=’com.sybase.jdbc3.jdbc.SybDriver’/>
</ReferentialDevice>
					
<ReferentialDevice name= ‘BIRDS_DEVICE’ device=’fr.genoscope.lis.devsi.birds.device.InternalReferentialDevice’ />
					
<Project name="Tara" workspace="%PROJECT_WORKSPACE"/>
					
</Declaration>
					
					

Treatment configuration

  • Treatment : defined by a unique name. Possible options, "checkIfNewTreatmentAlreadyPerformed" : creation of BIRDS job only for new resources combination according with consumed resources , "remove history" : remove intermediate history and "useJobScheduler" : to activate job scheduler execution (by default is local).
  • Executable : defined by executable path and version. The command line syntax can be specific and defined by a strategy choose by user with "commandSyntaxStrategyName" otpion. Users can choose the execution hostname thereby BIRDS wraps command line execution by a "SSH" command. If option "jobScheduler" is activated, users must defined parameters according to job scheduler technology choosen. BIRDS allows LSF and Slurm implementation.
  • Input/Output : defined by a name, resources referential(s) and resource type supported.
  • Parameter : can be defined by key/value pairs or line value.

<Treatment name="Blast" checkIfNewTreatmentAlreadyPerformed=”true”  useJobScheduler=”true”  >
	
	<ExecutableSpecification >
		<Executable path="/env/cns/proj/Tara/blast.sh" version="1.0" commandSyntaxStrategyName="default" user=”userToConnect” host=”hostname" />
		<Slurm job="blast" account="blast_ccrt" part="normal" time="5:0:0" nb.tasks="1" nb.cpu="4" mem.cpu="48" nb.tasks.core="1"  groupOutput="true"/>
	</ExecutableSpecification>
	
	<ParametersSpecification >
		<KeyValueElement key=’-p’ value=’2’ /> 
		<LineValueElement parameterName=’param1’ line=’-p 2 -q 3’ />
	</ParametersSpecification >
	
	<InputsSpecification>
		<InputElement name="bank_blast_input" count="1" resourceType="Bank">
			<Referential name="CABRI" />
		</InputElement>
		<InputElement name="blast_input" count="1" resourceType="SEQ">
			<Referential name="LIMS_PROD" />
		</InputElement>
	</InputsSpecification>
	
	<OutputsSpecification>
		<OutputElement name="blast_output" resourceType="BLAST_OUTPUT" />
			<Referential name="LIMS_PROD" />
		</OutputElement>
	</OutputsSpecification>
</Treatment>
					
					

BIRDS client is a java application, a process started by the administrator.

These process will automate the generation and execution of jobs based on treatment configuration.

  • fig 3 : BIRDS client

BIRDS client consists of two parts (cf fig 3) :

Process of resource management
Composed of a workflow that runs every 15 min. This process queries all treatment specification declared by the administrator and stored in BIRDS database. For each treatment, process retrieves available resources (as resources combinaison) and provides to job management process. This process stops when all specifications are processed and the cycle restarts after 15 min.
Process of job management
Composed of two independent and continuous process as Java Thread :
Process for job creation
For each resources made available by the process of resource management, process creates job to execute and provides to process of job execution.
Process for job execution
For each job made available by the process of job creation, process execute job.

The business logic is defined by the user outside BIRDS API thanks Drools rules engine rules. BIRDS processes interprete rules at runtime at several stages allowing user to define a specific context according with the current resources or job (cf fig 4). For example, user can define rules at runtime to filter resources at stage "resource selection" or to calculate a parameter at stage "building command line".

  • fig 4 : Control process by rules

Exemple of selection resources rules

 
rule "selection lotSeq from database device LIMS"
@BirdsRule( selectionRule )
dialect 'java'
salience 300 

	when
	   $input : InputSpecificationElement( name == "bank_blast_input", treatmentSpecification.name == "Blast")
	   $resourcesReferential : ResourcesReferential(name=="CABRI")
	   $device : DatabaseDevice() from $resourcesReferential.referentialDevice 
	   $rps : ResourcePropertiesSet(initialized==false, inputSpecificationElement==$input, resourcesReferential== $resourcesReferential)
		
	then
	   $rps.initialize();
	   Set<ResourceProperties> resourcesPropertiesSet = $device.getPropertiesExecuteQuery("SELECT * from cabri_table");
	   $rps.addResourcePropertiesSet(resourcesPropertiesSet);
	   modify($rps){};
end
				

BIRDS Processes handle errors encountered by sending signals error to the main workflow (process of resource management). These signals allow breakpoints in the workflow, but can also be intercepted by users rules. Users can also define an action when errors occurs, for example, alerting by email or relaunch job.

  • fig 5 : Control error by rules
  • Job Generation

    • High speed by continuous scanning available resources
    • Flexibility by adding new rules
    • Not necessary tree dependencies to define workflow
    • Resources traceability and job history
  • Job Execution

    • Large scale analysis through multi-thread implementation. Job execution is decoupled from the job generation.
    • Process control by intercepting business rules at different stage.
    • Job execution in a job scheduler