If You Want to Take Your Bioinformatics Analysis to the Cloud
A paper published in Biology Direct by Lin Dai, Xin Gao, Yan Guo, Jingfa Xiao and Zhang Zhang provides a good list of programs that are currently available. (h/t: @sgivan)
Data as a Service (DaaS)
AWS: Public Datasets Cloud-based archives of GenBank, Ensembl, 1000 Genomes, Model Organism Encyclopedia of DNA Elements, Unigene, Influenza Virus, etc.
-——-
Software as a Service (SaaS)
BGI Cloud Cloud-based implementations of various genomic analysis applications
CloudAligner Fast and full-featured MapReduce-based tool for sequence mapping
CloudBLAST A cloud-based implementation of NCBI BLAST
CloudBurst Highly sensitive short read mapping with MapReduce
Contrail Cloud-based de novo assembly of large genomes
Crossbow Read Mapping and SNP calling using cloud computing
EasyGenomics Cloud-based NGS pipelines for whole genome resequencing, exome resequencing, RNA-Seq, small RNA and de novo assembly
eCEO Cloud-based identification of large-scale epistatic interactions in genome-wide association study (GWAS)
FX RNA-Seq analysis tool
Gaea Cloud-based genome re-sequencing assembly
Hecate (unpublished) Cloud-based de novo assembly
Jnomics (unpublished) Cloud- scale sequence analysis suite based on Apache Hadoop
Myrna Differential gene expression tool for RNA-Seq
PeakRanger Cloud-enabled peak caller for ChIP-seq data
RSD: Reciprocal smallest distance algorithm for ortholog detection using Amazon’s Elastic Computing Cloud
VAT Variant annotation tool to functionally annotate variants from multiple personal genomes at the transcript level
YunBe Pathway-based or gene set analysis of expression data
-—————
Platform as a Service (PaaS)
Eoulsan Cloud-based platform for high throughput sequencing analyses
Galaxy Cloud Cloud-scale Galaxy for large-scale data analysis
-——————-
Infrastructure as a Service (IaaS)
Cloud BioLinux: A publicly accessible virtual machine for high performance bioinformatics computing using cloud platforms
CloVR: A portable virtual machine for automated sequence analysis using cloud computing;
One interesting aspect of Biology Direct is that it allows readers to see reviewers’ comments. The following exchange between Dr. Igor Zhulin (University of Tennessee, United States of America) and the authors illucidates on HPC versus cloud issues.
Reviewer 2: The review summarizes advantages of using cloud computing for big data storage and analysis issues in bioinformatics. In general, it does a fair job on this front. However, disadvantages of clouds are not discussed in this review at all. For example, time-critical calculations, complex tasks that require data management (load balancing, fault tolerance issues, etc.) will not do well on clouds that lack the edge of advanced HPC architectures.
Authors response: Thanks for your valuable comments. We accepted your comments and added some description in the main text. Hadoop (http://hadoop.apache.org) features two key modulesMapReduce and Hadoop Distributed File System (HDFS). MapReduce divides a computational program into many small sub-problems and distributes them on multiple computer nodes, and HDFS provides a distributed file system that stores data on these nodes. Hadoop and its associated software are designed to handle load balancing among multiple nodes and to detect node failures that can be automatically re-executed on any node. Therefore, Hadoop is capable of performing time-critical calculations by distributing tasks and large datasets over multiple computer nodes, supporting big data scaling, and enabling fault-tolerant parallelized analysis.
-————–
Edit.
In the context of doing Bioinformatics in the cloud, it is worth mentioning that our long time reader Mikael Huss maintains a highly informative blog (Follow the Data). Mikael suggests looking into Hadoop-based program Seal for doing similar tasks as Jnomics, namely cloud-scale sequence analysis.