Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

K-mer Counting Algorithms

For new readers, easiest way to follow us is through our twitter feed. The feed is updated, whenever we post a commentary here.

Usually, when we receive a new genomic or transcriptomic read library, we generally like to check k-mer distribution before performing assemblies or other serious analysis. A number of efficient k-mer counting algorithms have been publicly available, and all of them give out easily compilable source codes.

A. Bloom Filter-based Approach

This method uses the fact that, in real data, large number of k-mers are singletons appearing due to sequencing errors. Bloom filter based approach takes the least amount of memory, but is slightly slower than JELLYFISH hashing approach.

i) Efficient counting of k -mers in DNA sequences using a bloom filter, Páll Melsted and Jonathan K Pritchard, BMC Bioinformatics 2011, 12:333 doi:10.1186/1471-2105-12-333.

BFcounter code is here.

ii) khmer, code

B. Hashing-based Approach as in JELLYFISH

i) Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics (2011) 27(6): 764-770 doi:10.1093/bioinformatics/btr011.

It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.

Their manual is available here.

C. Meryl

We are not sure of how efficient the algorithm is. Their website says -

An out-of-core k-mer counter. The amount of sequence that can be processed for any size k depends only on the amount of free disk space.

More here.

D. Tallymer – Suffix array based approach

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes Stefan Kurtz, Apurva Narechania, Joshua C Stein and Doreen Ware, BMC Genomics, 2008, 9:517 doi:10.1186/1471-2164-9-517.

DSK: K-mer Counting with Very Low Memory Usage Share…..00000 Minia genome assembly algorithm, which accomplished the unusual feat of assembling human genome in a desktop computer, used a novel k-mer counting method to keep memory usage below 4G limit. It was done, because k-mer counting takes much more RAM than the actual assembly from a reduced set of k-mers.

Their k-mer counting method is now published as a separate paper in Bioinformatics. We wonder, whether the name DSK was chosen to show respect to the French leader.

Most important table from the paper is shown below. DSK with SSD appears to perform as well as Jellyfish, but using only 1/18th as much RAM and 4 threads instead of 8 in Jellyfish. However, DSK allows you to do k-mer counting for arbitrary length, whereas Jellyfish is restricted to 32 mers.

Here is the abstract:

Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers. AVAILABILITY: http://minia.genouest.org/dsk CONTACT: rayan.chikhi@ens-cachan.org.