Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Why Should Ordinary Bioloinformaticians Learn about de Bruijn Graphs?

Researchers with NGS sequences typically use various software packages, such as Velvet, ABySS, Trinity, Oases, SOAPdenovo, etc. to assemble the underlying genomes, transcriptomes or metagenomes. Almost all of those software tools use de Bruijn graphs to reconstruct genomes from NGS libraries. Although running those packages do not require any knowledge of de Bruijn graphs, such understanding is necessary to improve the quality of work in terms of minimizing errors in assembly, memory optimization, etc. In fact, exhausting available computer memory (randomly accessed memory or RAM) is the first obstacle faced by new bioinformaticians working on de Bruijn graph-based assemblers. The most obvious and commonly saught solution of seeking a computer with large RAM does not scale well with ever-increasing amount of data. A number of elegant algorithms were recently proposed by several bioinformaticians, but their understanding and implementation require some knowledge of the underlying assembly process.

It is often argued that assembling genomes of higher eukaryotes is very challenging even for those with full understanding of the algorithms. Is learning the inner workings of assemblers of any benefit to the great majority of other bioinformaticians? The answer is yes, because de Bruijn graph-based algorithms have become important for a new class of problems, namely transcriptome assembly. Transcriptomes did not need to get assembled in the older days of Sanger sequencing, because the sizes of ESTs were comparable to genes, whereas the gene expression was measured by microarray technology. High-throughput NGS can solve both problems together, but shorter NGS reads need to be assembled into genes, especially for organisms lacking underlying genomes. Trinity and Oases transcriptome assemblers became popular among researchers working on RNAseq data.

Over the last eighteen months, we wrote several introductory articles on de Bruijn graph-based assembly programs at our blog (http://homolog.us), and they remain popular among our readers. For their convenience, we wrote these tutorials to explore the same topics in further detail. Here we describe how de Bruijn graph-based assembly programs work, show the impact of sequencing error on graphs, explain why so much RAM is needed by the assemblers and discuss the differences between genome assemblers, transcriptome assemblers and metagenome assemblers.


Web Statistics