Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

De Bruijn Graphs for Long Reads

Researchers working with data from multiple sequencing platforms wonder how de Bruijn graphs take long reads into account. The simple answer is that de Bruijn graphs are agnostic about the read length. The construction process of de Bruijn graphs chooses a k-mer, and parses all reads, long or short, into k-mers of chosen size. After the graph is constructed, all long-range information available from the long reads are lost. Such long range information is most useful to resolve repetitive regions of the genome.

Researchers can employ two methods to retain long-range information. The first approach is to increase the k-mer size, but if k-mer size exceeds the maximum read lengths of short read libraries, short reads cannot be included in the construction of de Bruijn graph. In fact, the de Bruijn graph from short reads start to get fragmented long before k-mer size reaches their maximum length. Second approach is to construct and solve de Bruijn graph with a reasonable k-mer size that properly handles short reads, and then use long read information to further connect assembled fragments.

Long reads from most sequencing platform can be included during de Bruijn graph construction process mentioned as the second approach above, but there is one exception – PacBio. The reads from Pacific Biosciences instruments are long, but also contain elevated amount of noise distributed uniformly within the reads. Constructing de Bruijn graphs from noisy data damages the graph topology and makes it very hard to assemble. There are two ways to utilize PacBio reads - (i) solve the graph based on only the short reads and then join contigs based on PacBio, (ii) clean PacBio reads using Illumina data and then use them in assembly.