"The E-inframMPS2015 workshop brought together scientists sharing their experiences on how to build efficient and sustainable e-infrastructures for massively parallel sequencing data management and storage, as well as setting up and maintaining an associated ecosystem of workflows, pipelines, and bioinformatics software." There was much information to digest at E-inframMPS2015. In the post i'll be focusing on summarizing the ecosystem of workflows in bioinformatics. ChipsterChipster provides easy access to >340 analysis tools, no programming/command-line experience req'd. Free, open source. Visualize in GUI, share analysis sessions, integrate data. Java Web Start application. #einframps2015 Chipster easy to install. VM contains all the analysis tools and reference data. Easy to add new tools. BpipeBpipe. "pipeline interpreter and runner", written in Java and Groovy. Bpipe has good audit trails. Runs on all major queuing systems. http://www.ncbi.nlm.nih.gov/pubmed/22500002 piperJohan Dahlberg on the Piper workflow system built on top of GATK Queue Queue: parallelizes workflows, robust (reruns failed jobs), traceable, deletes intermediate files, reusable components Queue is "battle-hardened" (has been used a lot by e g Broad Inst.), has nifty utilities for NGS. https://www.broadinstitute.org/gatk/guide/topic?name=queue A pipeline project started at the SNP&SEQ Technology platform built on top of GATK Queue. Since then Piper has been adopted by the Swedish National Genomics Infrastructure (NGI) for use in the the Swedish Genomes Program as well as for samples submitted through the Illumina Genome Network to the NGI platform. Piper builds on the concept of standardized workflows for different next-generation sequencing applications. At the moment Piper supports the following workflows: WholeGenome: For human whole genome sequencing data. This goes through alignment, alignment quality control, data processing, variant calling, and variant filtration according to thebest practice recommended by the Broad Institute, using primarily the GATK. Exome: TruSeq and SureSelect human exome sequencing: These use basically the same pipeline as the whole genome pipeline, but with the modifications suggested in the best practice document for exome studies. Haloplex: Haloplex targeted sequencing analysis. Including alignment, data processing, and variant calling. RNACounts: Produces FPKMs for transcripts of an existing reference annotation using Tophat for mapping and Cufflinks to produce the FPKMs. All supported workflows are available in the workflows directory in the project root. Hitachi Johan Westin from Hitachi Data Systems on Hitachi Object Based solutions for genomics LuigiSamuel Lampa (BILS/Dept of Pharm. Biosci. in Uppsala) on using Luigi, a framework developed by @spotify, for bioinformatics Luigi supports both command line and Hadoop execution. Powers 1000s of jobs each day at Spotify. Luigi has a nice visualization interface for task execution. Shows dependency graph. Luigi is task focused. Implement requires() and output() for each task. This implicitly defines a dependency graph. CRS4 on GalaxyLuca Pireddu, CRS4 on Galaxy as a workflow manager. Execute and record operations on data.Galaxy's web interface not suitable for automation, so control it using its REST API. Adapted BioBlend to this end Created the "Hadoop-Galaxy adapter" to address incompatibilities between Hadoop and Galaxy #einframps2015 Collections of tools for galaxy to incorperate hadoop and other hadoop based tools (crs4?) SnakeMakeMaciej Kandula (Chair of Bioinformatics Vienna) on Snakemake Lightweight workflow systems. make is great but lacks some features. Snakemake and Bpipe are similar in many ways. Recommends Sean Davis' snakemake tutorial http://watson.nci.nih.gov/~sdavis/blog/flexible_bioinformatics_pipelines_with_snakemake/ … CloudGeneSebastian Schönherr from Medical U of Innsbruck in Austria on Hadoop pipelines for NGS analysis CloudGene executes MapReduce prgs and combines them into workflows. Runs in your browser. Integrates existing tools. Built around the idea that Linux is the lingua franca of data science. Stream oriented Nextflow extends the Unix pipes model with a fluent DSL, allowing you to handle complex stream interactions easily. Fast prototyping
Unified parallelism Nextflow is based on the dataflow programming model which greatly simplifies writing complex distributed pipelines. Portable Nextflow provides an abstraction layer between your pipeline's logic and the execution layer, so that it can be executed on multiple platforms without it changing. Continuous checkpoints All the intermediate results produced during the pipeline execution are automatically tracked. This allows you to resume its execution, from the last successfully executed step, no matter what the reason was for it stopping. Reproducibility Nextflow supports Docker container technology. This, with the ability to run on multiple platforms, allows you to write self-contained and truly reproducible pipelines. |
GRINONE SOLUTIONS > Blog >