Quest to a perfect (NGS) production pipeline Mateusz Kuzak (eScienceCenter/UvA) Wibowo Arindrarto (LUMC) Peter van `t Hof (LUMC) Leon Mei (LUMC) …... Agenda Cluster: LUMC (SGE), UMCG (PBS), WUR (SLURM), UMCU (SGE), KeyGene (?) Server: ErasmusMC, AMC, VUMC, UMCN, UMCM Cloud, Grid, Clusters at SURFsara NGS groups & Infrastructure Pipeline ● set of subsequent analysis steps ● output of one step is input for next one ● input – raw data ● output – alignments, cont tables, visualizations Perfect pipeline Sustainable good support and reliable community ● Robust can rerun part of the pipeline ● Scalable utilize multiple cores (in a cluster) ● Modular and no boiler plate code swap in/out similar components (e.g. switching aligners), modules can be written in different languages ● Portable can easily run on a different site ● Transparent control directly manage script, file location and change parameters ● User friendliness defining jobs and setting parameters should be done via an easy to read file format (e.g. YAML) ● Provenance explicit tracking of all scripts and options used, executed steps for report generation or monitoring using a webpage ... ● Options MOA command-line workflows for bioinformatics ● Ruffus light-weight Python Computational Pipeline Management ● GNU make standard unix build tool ● Snakemake python based language (DSL) ● Bpipe Java and Groovy-based tool ● Bcbio-nextgen community based (Blue Collar Bioinformatics) ● Molgenis-compute local expertise ● GATK queue Scala based pipeline ● Galaxy GUI, active community ● Further possibilities new initiatives from open source community, scientific workflow projects ● GATK queue #1 ● java -jar Queue.jar -S <script>.scala java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -run ● ● MIT license, made in Broad, roadmap is unclear Use DRMAA (native support to LSF, Grid Engine, batches available for PBS, condor, etc) ● ● can visualize pipeline into a dot graph GATK queue #2 GATK queue #3 Agenda