Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Memory Requirement and k-mer Distribution of Perfect Library

Bioinformaticians trying to assemble genomes or transcriptomes from large NGS libraries usually grapple with two problems – (i) how to set k-mer parameters to get the best assembly, and (ii) how to complete the assembly within RAM limits of the computer.

If all reads are perfect, they all match the de Bruijn graph of the genome and there is no need to add any more node or link. Therefore, irrespective of whether we sequence the genome at 10X depth or 1000X depth, the size of the de Bruijn graph is limited by the size of the underlying genome, not volume of data. This conclusion is unchanged, whether we knew the genome beforehand, or we are trying to assemble it de novo from short read data.

The above observation has many implications in designing scalable and hardware-optimized algorithms for assembling large libraries.Reasearchers often ask, how much RAM they would need to assemble a genome from HiSeq data. With an assembly algorithm relying on de Bruijn graphs, the answer depends on the size of genome being assembled, not the volume of sequence data. In the world of perfect reads, assembling an yeast genome from three lanes of HiSeq data will need far less RAM than assembling a vertebrate genome from similar amount of data.

However, we do not live in the perfect world and all sequencing libraries have some degree of error. In the next section, we will discuss how sequencing errors change the structure of the de Bruijn graph and greatly increase memory requirement. That means some kind of pre-filtering can greatly reduce memory requirement.