Succinct de Bruijn Graphs from Tetsuo Shibuya's Group
Reader Lo kindly forwarded us the paper titled Succinct de Bruijn Graphs by Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya and kept us busy over the weekend. When we covered Arapan-S paper from the same group few weeks back, we did not take them too seriously. After all, who wants to solve viral genomes, when we have mammalian genomes and metagenomes to take care of? However, their current paper suggests that they definitely deserve a place in our NGS innovation map.
Search for Low Memory Assembler
If you have been following our commentaries over the last few months, you know that computer scientists are up in arms to solve an important problem regarding NGS genome assembly, namely how to do it with less RAM. Unlike large genome centers, most computer scientists do not have sugar-daddies to buy them 512GB RAM computers. They are compelled to find low-memory alternatives for assembling genomes, so that they can stay in the game without going broke.
To quickly recap, here are few proposed alternatives by various groups.
1. Gossamer from Conway and Bromage: T. C. Conway and Andrew Bromage from NICTA, Australia asked the theoretical question of how little RAM a de Bruijn graph would need, if they wanted to store all node and edge information? They came up with roughly 30.7 GB for human genome (22 bits/edge), and implemented their ideas in Gossamer assembler.
2. SparseAssembler: Ye et al. showed that it was possible to do assembly without storing all nodes, and demonstrated their ideas in their SparseAssembler implementation. Recently released SOAPdenovo2 assembler comes with a pregraph module that likely implements ideas similar to SparseAssembler.
3. Minia from Chikhi and Rizk: Rayan Chikhi of INRIA, France decided to implement Bloom filter data structure and save only the nodes, not edges, to cut memory requirement lot further than theoretical limit of Conway/Bromage.
4. Meraculous (also discussed here): Meraculous algorithm from JGI was similar to Rayan’s Minia regarding its decision to store only the nodes. However, instead of Bloom filter, they stored their nodes in a structure called perfect hash. As Rayan explained in comment section, their memory usage was 21.6 bits/k-mer, or close to Conway/Bromage limit.
5. String Graph Assembler from Simpson and Durbin (also check here): Simpson and Durbin decided to part with de Bruijn graph altogether and implemented string graph assembler to significantly reduce memory usage. String graph assembler is close to overlap-layout-consensus method for assembly, but the biggest contribution of Simpson/Durbin was to figure out a way to use Burrows Wheeler transform to reduce time requirement for finding the overlaps.
Alex Bowe’s Succinct de Bruijn Graphs
The succint de Bruijn graph structure proposed by Bowe et al. reduces memory requirement significantly from the derived theoretical limit of Conway/Bromage, while storing all the edges of de Bruijn graph. From their paper:
For example, the succinct representation of Conway and Bromage [5] uses 40.8GB for storing a de Bruijn graph with m =12,292,819,311 edges and k = 27 (28.5 bits per edge). On the other hand, if we use an efficient implementation of rank/select data structures [17] for our representation, the estimated size is less than 5 bits per edge. Therefore the above graph is stored in less than 8GB.
How do they do that? They figured out a way to perform Burrows Wheeler transform on the nodes/edges of a de Bruijn graph and store the graph in compressed manner, while being able to access the nodes rapidly. In their paper, they presented an example of a de Bruijn graph of three reads in Fig. 1 and explained how it looked in compressed form in Fig. 2. They also provide sample code for their compression algorithm here. The code comprises of only one 500 line long C-program sdg.c that compiles very easily.
Enjoy !!