Seven Major Trend Changes of 2013 - (ii) Bioinformatics

Seven Major Trend Changes of 2013 - (ii) Bioinformatics

2. In NGS Bioinformatics, SPAdes Assembler, BCR Algorithm and Diginorm Approach Gained Prominence

a) SPAdes and Scaffolding Problem

Velvet was all the rage in 2012, when we first reported about SPAdes from Russia’s Algorithmic Biology lab (Check here, here and here). One year later, SPAdes gained recognition by performing well in the GAGE-B evaluation.

In this respect, the biggest change in perception had been the realization that scaffolding was a non-trivial step adding to most assembly errors. Authors of SPAdes assembler spent considerable intellectual fire power to solve the scaffolding problem.


We wrote several commentaries on other excellent assemblers or on assembly- related issues. A small subset is here.

SPAdes and MaSuRCA Assemblers Performed Best in GAGE-B Evaluation

Our First Look at SOAPdenovo2 Source Code

On MaSuRCA Paper and Algorithm

Cleverness of the Ray Assembler

Very Helpful Preprocessing Module for Those Interested in Assembling Genomes

Rayan Chikhis KmerGenie Slides from HitSeq 2013

GAM-NGS and REAPR Papers are Published


b) Alternate Approaches for Processing Large Libraries - Bauer-Cox-Rosone Algorithm/BEETL/Ropebwt and Kmer Counting/Diginorm/Sailfish

Heng Li brought our attention to a series of excellent papers written by Anthony Cox, Giovanna Rosone and co-authors in 2012. In “Heng Li Releases Ropebwt2”, we wrote -

We discussed BCR algorithm and related topics on BWT construction from short reads many times (here - check comment section, here, here and here). Readers may find the following implementation useful (h/t: @rayanchikhi). Version 1 of this code is an important component of SGA assembler.

The goal of their work was to improve the processing of large NGS files by converting them into FM index. When we checked “the problems being solved by the top NGS bioinformaticians today?” in Nov 2013, we found that most had a project related to implementation of similar algorithm. That definitely qualifies as a change in perception, and it is driven by the large sizes of HiSeq libraries and the time it takes to compress/process/ftp them.

Titus Brown and other researchers followed a different approach to solve the same problem of large libraries. They converted each read into k-mer units and used the k-mer collection as a substitute for the reads to expedite processing. Digital normalization used k-mers to prune the read collection, Sailfish used them to rapidly compute expression from RNAseq data without alignment and kSNP used k-mers to find SNPs without alignment. For k-mer counting, programs like Jellyfish are popular, but readers may take a look at the following low-memory methods from Rayan Chikhi and Titus Brown.

DSK: K-mer Counting with Very Low Memory Usage

Efficient Online k-mer Counting using a Probabilistic Data Structure

What has not been generally recognized is that k-mer counting and building BWT are equivalent approaches. In fact, one can easily build the Burrows Wheeler transform of a large library by starting from its k-mer counts.


Speaking of blogs on algorithms, here is an incomplete list of several excellent ones -

Alex Bowe’s Blog

Heng Li’s Blog

We will soon add a few others to the above list.


In the following commentary, we will cover -

Seven Major Trend Changes of 2013 (iii) Genomics

Written by M. //