Workflows in bioinformatics

posted Jan 23, 2015, 2:40 PM by Klc Kelsey   [ updated Jan 23, 2015, 2:51 PM by Thomas Sierocinski ]

Was following the chatter on twitter regarding the E-InfraMPS2015. 

"The E-inframMPS2015 workshop brought together scientists sharing their experiences on how to build efficient and sustainable e-infrastructures for massively parallel sequencing data management and storage, as well as setting up and maintaining an associated ecosystem of workflows, pipelines, and bioinformatics software."

There was much information to digest at E-inframMPS2015. In the post i'll be focusing on summarizing the ecosystem of workflows in bioinformatics. 


Chipster provides easy access to >340 analysis tools, no programming/command-line experience req'd. Free, open source.
Visualize in GUI, share analysis sessions, integrate data. Java Web Start application. #einframps2015
Chipster easy to install. VM contains all the analysis tools and reference data. Easy to add new tools.


Bpipe. "pipeline interpreter and runner", written in Java and Groovy.

Bpipe has good audit trails. Runs on all major queuing systems.


Johan Dahlberg on the Piper workflow system built on top of GATK Queue
Queue: parallelizes workflows, robust (reruns failed jobs), traceable, deletes intermediate files, reusable components

Queue is "battle-hardened" (has been used a lot by e g Broad Inst.), has nifty utilities for NGS.

A pipeline project started at the SNP&SEQ Technology platform built on top of GATK Queue. Since then Piper has been adopted by the Swedish National Genomics Infrastructure (NGI) for use in the the Swedish Genomes Program as well as for samples submitted through the Illumina Genome Network to the NGI platform.

Piper builds on the concept of standardized workflows for different next-generation sequencing applications. At the moment Piper supports the following workflows:
WholeGenome: For human whole genome sequencing data. This goes through alignment, alignment quality control, data processing, variant calling, and variant filtration according to thebest practice recommended by the Broad Institute, using primarily the GATK.
Exome: TruSeq and SureSelect human exome sequencing: These use basically the same pipeline as the whole genome pipeline, but with the modifications suggested in the best practice document for exome studies.
Haloplex: Haloplex targeted sequencing analysis. Including alignment, data processing, and variant calling.
RNACounts: Produces FPKMs for transcripts of an existing reference annotation using Tophat for mapping and Cufflinks to produce the FPKMs.

All supported workflows are available in the workflows directory in the project root.


Johan Westin from Hitachi Data Systems on Hitachi Object Based solutions for genomics


Samuel Lampa (BILS/Dept of Pharm. Biosci. in Uppsala) on using Luigi, a framework developed by @spotify, for bioinformatics

Luigi supports both command line and Hadoop execution. Powers 1000s of jobs each day at Spotify.

Luigi has a nice visualization interface for task execution. Shows dependency graph.
Luigi is task focused. Implement requires() and output() for each task. This implicitly defines a dependency graph.

CRS4 on Galaxy

Luca Pireddu, CRS4 on Galaxy as a workflow manager. Execute and record operations on data.
Galaxy's web interface not suitable for automation, so control it using its REST API. Adapted BioBlend to this end
Created the "Hadoop-Galaxy adapter" to address incompatibilities between Hadoop and Galaxy #einframps2015

Collections of tools for galaxy to incorperate hadoop and other hadoop based tools (crs4?)


Maciej Kandula (Chair of Bioinformatics Vienna) on Snakemake
Lightweight workflow systems. make is great but lacks some features. Snakemake and Bpipe are similar in many ways.
Recommends Sean Davis' snakemake tutorial …


Sebastian Schönherr from Medical U of Innsbruck in Austria on Hadoop pipelines for NGS analysis
CloudGene executes MapReduce prgs and combines them into workflows. Runs in your browser. Integrates existing tools.

Built around the idea that Linux is the lingua franca of data science.

Stream oriented

Nextflow extends the Unix pipes model with a fluent DSL, allowing you to handle complex stream interactions easily.
It promotes a programming approach, based on functional composition, that results in resilient and easily reproducible pipelines.

Fast prototyping

Nextflow allows you to write a computational pipeline by making it simpler to put together many different tasks.
You may reuse your existing scripts and tools and you don't need to learn a new language or API to start using it.

Unified parallelism

Nextflow is based on the dataflow programming model which greatly simplifies writing complex distributed pipelines.
Parallelisation is implicitly defined by the processes input and output declarations. The resulting applications are inherently parallel and can scale-up or scale-out, transparently, without having to adapt to a specific platform architecture.


Nextflow provides an abstraction layer between your pipeline's logic and the execution layer, so that it can be executed on multiple platforms without it changing.
It provides out of the box executors for SGE, LSF, SLURM and PBS/Torque batch schedulers and for the DNAnexus cloud platform.

Continuous checkpoints

All the intermediate results produced during the pipeline execution are automatically tracked. This allows you to resume its execution, from the last successfully executed step, no matter what the reason was for it stopping.


Nextflow supports Docker container technology. This, with the ability to run on multiple platforms, allows you to write self-contained and truly reproducible pipelines.