The genome assembly field continues to be highly active, and the researchers are still coming up with algorithms making significant speed improvements. The following three projects are definitely worth your attention.
1. Peregrine by Jason Chin
Soon it will take less time to assemble the human genome de novo than talking about how to assemble the human genome. Jason Chin recently posted on twitter -
If you are not in #SFAF2019, here is my slide deck for a new genome assembly approach implemented in the Peregrine assembler: https://speakerdeck.com/jchin/assembling-human-genome-in-100-minutes … Exciting to talk about it in 20 minutes..
You can access Peregrine here.
2. Wtdbg2 by Jue Ruan and Heng Li
Jue Ruan and Heng Li recently posted a preprint on biorxiv titled “Fast and accurate long-read assembly with wtdbg2”.
Existing long-read assemblers require tens of thousands of CPU hours to assemble a human genome and are being outpaced by sequencing technologies in terms of both throughput and cost. We developed a novel long-read assembler wtdbg2 that, for human data, is tens of times faster than published tools while achieving comparable contiguity and accuracy. It represents a significant algorithmic advance and paves the way for population-scale long-read assembly in future.
You can access wtbd2 here. Those working on nanopore data should pay attention to the limitation column - “For Nanopore data, wtdbg2 may produce an assembly smaller than the true genome.”. It is unclear whether that is true for the other two assemblers.
3. Flye by Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, Pavel. A. Pevzner
This preprint titled “Assembly of Long Error-Prone Reads Using Repeat Graphs” was posted in January 2018. It is currently published in Nature Biotechnology.
The problem of genome assembly is ultimately linked to the problem of the characterization of all repeat families in a genome as a repeat graph. The key reason the de Bruijn graph emerged as a popular short read assembly approach is because it offered an elegant representation of all repeats in a genome that reveals their mosaic structure. However, most algorithms for assembling long error-prone reads use an alternative overlap-layout-consensus (OLC) approach that does not provide a repeat characterization. We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based de Bruijn graph, assembles genomes using an alignment-based A-Bruijn graph. In difference from existing assemblers, Flye does not attempt to construct accurate contigs (at least at the initial assembly stage) but instead simply generates arbitrary paths in the (unknown) assembly graph and further constructs an assembly graph from these paths. Counter-intuitively, this fast but seemingly reckless approach results in the same graph as the assembly graph constructed from accurate contigs. Flye constructs (overlapping) contigs with possible assembly errors at the initial stage, combines them into an accurate assembly graph, resolves repeats in the assembly graph using small variations between various repeat instances that were left unresolved during the initial assembly stage, constructs a new, less tangled assembly graph based on resolved repeats, and finally outputs accurate contigs as paths in this graph. We benchmark Flye against several state-of-the-art Single Molecule Sequencing assemblers and demonstrate that it generates better or comparable assemblies for all analyzed datasets.
You can access Flye here.
You can see the source codes of all three programs online, but should you try to understand them? Here is a warning (h/t: @infoecho, @rayanchikhi) -
“A good genome assembler is like a good sausage, you’d rather not know how it was made” - S Gnerre, ALLPATHS assembler