Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Summary

The above discussion introduces de Bruijn graphs for our purpose. Why they are so popular for genome assembly programs will be explained in the following section. Before closing, we note few points about de Bruijn graphs that will be helpful in our future discussions.

Given any sequence and k-mer size, we can create a de Bruijn graph in an unique manner.
The other direction is not true. All de Bruijn graphs cannot be resolved into unique sequences. Unless the de Bruijn graph is in its simplest form, it usually resolves into many possible sequences.
Larger the k-mers, more easy it is to convert the de Bruijn graph into an unique sequence.
Higher k-mer size generally requires more computer memory to store and process the graph. Therefore, how much RAM a machine has sets the upper limit for value of k, unless the data structures are properly designed. With a machine with unlimited RAM and appropriate data structure, k-mer size is limited by short read size (discussed in the following section).
The nodes of de Bruijn graph are different from ordinary graphs, because each node can be guessed from its adjacent node by going through the dictionary of symbols. For example, if we know the 7-mer of de Bruijn graph to be ATGGAAG, its adjacent node can be only TGGAAGA, TGGAAGC, TGGAAGG and TGGAAGT out of 4^7 possibilities. Similarly, if we know two nodes in a linear region to be ATGGAAG and GGAAGTC, we can deduce the middle node as TGGAAGT. In general, extensive amount of redundancies exist in the graph structure and can be exploited to store the graph in less memory.