Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Varying K-mer Sizes to Compensate for Above Deficiency

We often run Velvet or other de Bruijn assemblers with varying K-mer sizes (21, 23, 25, 27, etc.) to find the best assembly. Why does the method work? Let us present an intuitive explanation.

image

If someone gives you the above de Bruijn graph, would you be able to reproduce the underlying sequence? The answer is yes, because each 7-mer has unique neighbors, from which the next nucleotide can be determined. If all de Bruijn graphs were like the above figure, genome assembly would have been an easy problem. In reality, we encounter de Bruijn graphs with many criss-crosses, originating primarily from the repetitive regions of a genome.

Here is a simple example of two overlapping branches of a de Bruijn graph.

image

The graph can be traversed through four paths (shown below in four colors), but are they all real? When we inspect the underlying sequence, we find that only the red and green branches have supporting evidence. The above picture is a good example of a scenario, where information is lost after constructing de Brujin graph. The original sequences were likely from non-repetitive regions of a genome, but they happened to have a common k-mer. That common k-mer made two branches overlap, giving the assembler four paths to resolve instead of two. You can see that two paths would separate by simply going from 7-mer to 9-mer. Increasing k-mer size resolves many spurious ambiguities, thus making the task of assembler easier.


Web Statistics