If You Want to Take Your Bioinformatics Analysis to the Cloud

A paper published in Biology Direct by Lin Dai, Xin Gao, Yan Guo, Jingfa Xiao and Zhang Zhang provides a good list of programs that are currently available. (h/t: @sgivan)

Data as a Service (DaaS)

AWS: Public Datasets Cloud-based archives of GenBank, Ensembl, 1000 Genomes, Model Organism Encyclopedia of DNA Elements, Unigene, Influenza Virus, etc.

-——-

Software as a Service (SaaS)

BGI Cloud Cloud-based implementations of various genomic analysis applications

CloudAligner Fast and full-featured MapReduce-based tool for sequence mapping

CloudBLAST A cloud-based implementation of NCBI BLAST

CloudBurst Highly sensitive short read mapping with MapReduce

Contrail Cloud-based de novo assembly of large genomes

Crossbow Read Mapping and SNP calling using cloud computing

EasyGenomics Cloud-based NGS pipelines for whole genome resequencing, exome resequencing, RNA-Seq, small RNA and de novo assembly

eCEO Cloud-based identification of large-scale epistatic interactions in genome-wide association study (GWAS)

FX RNA-Seq analysis tool

Gaea Cloud-based genome re-sequencing assembly

Hecate (unpublished) Cloud-based de novo assembly

Jnomics (unpublished) Cloud- scale sequence analysis suite based on Apache Hadoop

Myrna Differential gene expression tool for RNA-Seq

PeakRanger Cloud-enabled peak caller for ChIP-seq data

RSD: Reciprocal smallest distance algorithm for ortholog detection using Amazon’s Elastic Computing Cloud

VAT Variant annotation tool to functionally annotate variants from multiple personal genomes at the transcript level

YunBe Pathway-based or gene set analysis of expression data

-—————

Platform as a Service (PaaS)

Eoulsan Cloud-based platform for high throughput sequencing analyses

Galaxy Cloud Cloud-scale Galaxy for large-scale data analysis

-——————-

Infrastructure as a Service (IaaS)

Cloud BioLinux: A publicly accessible virtual machine for high performance bioinformatics computing using cloud platforms

CloVR: A portable virtual machine for automated sequence analysis using cloud computing;

One interesting aspect of Biology Direct is that it allows readers to see reviewers’ comments. The following exchange between Dr. Igor Zhulin (University of Tennessee, United States of America) and the authors illucidates on HPC versus cloud issues.

Reviewer 2: The review summarizes advantages of using cloud computing for big data storage and analysis issues in bioinformatics. In general, it does a fair job on this front. However, disadvantages of clouds are not discussed in this review at all. For example, time-critical calculations, complex tasks that require data management (load balancing, fault tolerance issues, etc.) will not do well on clouds that lack the edge of advanced HPC architectures.

Authors response: Thanks for your valuable comments. We accepted your comments and added some description in the main text. Hadoop (http://hadoop.apache.org) features two key modulesMapReduce and Hadoop Distributed File System (HDFS). MapReduce divides a computational program into many small sub-problems and distributes them on multiple computer nodes, and HDFS provides a distributed file system that stores data on these nodes. Hadoop and its associated software are designed to handle load balancing among multiple nodes and to detect node failures that can be automatically re-executed on any node. Therefore, Hadoop is capable of performing time-critical calculations by distributing tasks and large datasets over multiple computer nodes, supporting big data scaling, and enabling fault-tolerant parallelized analysis.

-————–

Edit.

In the context of doing Bioinformatics in the cloud, it is worth mentioning that our long time reader Mikael Huss maintains a highly informative blog (Follow the Data). Mikael suggests looking into Hadoop-based program Seal for doing similar tasks as Jnomics, namely cloud-scale sequence analysis.

‹»CHREC Slides from Supercomputing Conference« »Various Developments - 11/28/2012«›