Minimizer - An Introductory Tutorial

This is a condensed version of our longer tutorial on minimizer algorithms available here.

Many bioinformatics algorithms use short substrings of a longer sequence, commonly known as k-mers, for indexing, search or assembly. Minimizers allow efficient binning of those k-mers so that some information about the sequence contiguity is preserved.

Bioinformatics programs using minimizers -

Name Place of use
kmc2 K-mer counting
bcalm,bcalm2 Compacting dBG
kraken Sequence classification
minimap Mapping long reads

The use of minimizers in bioinformatics was proposed by Roberts et al in 2004. However, the method did not see any use for almost 10 years until the next-generation sequencing data overwhelmed the computing capacities using existing popular programs. In 2012, MSPkmer algorithm applied minimizers to greatly increase the efficiency of disk-based k-mer counting. Since then, minimizers have been utilized to align long reads, compact de Bruijn graphs, search through microbial genomes and in other places. In every case, minimizer-based algorithms performed substantially better than the existing algorithms.

The strength of minimizer comes from its ability to use sequence contiguity, while binning the k-mers. A naive alternative would be to distribute the k-mers based on their first six or seven nucleotides. In that case, every k-mer from a 100 nucleotide read is likely to go to a different bin. In contrast, minimizer-based binning fragments the read into ~3-4 pieces. In this tutorial, I worked out an example to explain the concept.

Definition of a Minimizer

Consider a 100-nucleotide read.

ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA

If we compile all 7-mer subsequences and their reverse complements from the entire read, we will get 94 pairs of words. A subset is shown below.

Position 7-mer (forward strand) 7-mer (reverse strand)
1 ATGCGAT ATCGCAT
2 TGCGATA TATCGCA
3 GCGATAT ATATCGC
4 CGATATC GATATCG
5 GATATCG CGATATC
6 ATATCGT ACGATAT
7 TATCGTA TACGATA
8 ATCGTAG CTACGAT
9 TCGTAGG CCTACGA
10 CGTAGGC GCCTACG
11 GTAGGCG CGCCTAC
12 TAGGCGT ACGCCTA
……    
94 GCTAGCA TGCTAGC

When those 188 words are arranged in an alphabetical order, AAAGCGC at position 72 of the forward strand comes first. Therefore, AAAGCGC is the 7-mer minimizer for the entire read. Similarly, we can identify minimizer for any other size by compiling subsequences of that size and sorting them alphabetically.

We can generalize the above concept by finding the minimizers for the subsequences of the original read. To demonstrate, let us find the minimizers for all 31-nucleotide subsequence. The first 39 subsequences are shown below.

Position 31-mer 7-mer minimizer
1 ATGCGATATCGTAGGCGTCGATGGAGAGCTA AGAGCTA
2 TGCGATATCGTAGGCGTCGATGGAGAGCTAG AGAGCTA
3 GCGATATCGTAGGCGTCGATGGAGAGCTAGA AGAGCTA
4 CGATATCGTAGGCGTCGATGGAGAGCTAGAT AGAGCTA
5 GATATCGTAGGCGTCGATGGAGAGCTAGATC AGAGCTA
6 ATATCGTAGGCGTCGATGGAGAGCTAGATCG AGAGCTA
7 TATCGTAGGCGTCGATGGAGAGCTAGATCGA AGAGCTA
8 ATCGTAGGCGTCGATGGAGAGCTAGATCGAT AGAGCTA
9 TCGTAGGCGTCGATGGAGAGCTAGATCGATC AGAGCTA
10 CGTAGGCGTCGATGGAGAGCTAGATCGATCG AGAGCTA
11 GTAGGCGTCGATGGAGAGCTAGATCGATCGA AGAGCTA
12 TAGGCGTCGATGGAGAGCTAGATCGATCGAT AGAGCTA
13 AGGCGTCGATGGAGAGCTAGATCGATCGATC AGAGCTA
14 GGCGTCGATGGAGAGCTAGATCGATCGATCT AGAGCTA
15 GCGTCGATGGAGAGCTAGATCGATCGATCTA AGAGCTA
16 CGTCGATGGAGAGCTAGATCGATCGATCTAA AGAGCTA
17 GTCGATGGAGAGCTAGATCGATCGATCTAAA AGAGCTA
18 TCGATGGAGAGCTAGATCGATCGATCTAAAT AGAGCTA
19 CGATGGAGAGCTAGATCGATCGATCTAAATC AGAGCTA
20 GATGGAGAGCTAGATCGATCGATCTAAATCC AGAGCTA
21 ATGGAGAGCTAGATCGATCGATCTAAATCCC AAATCCC
22 TGGAGAGCTAGATCGATCGATCTAAATCCCG AAATCCC
23 GGAGAGCTAGATCGATCGATCTAAATCCCGA AAATCCC
24 GAGAGCTAGATCGATCGATCTAAATCCCGAT AAATCCC
25 AGAGCTAGATCGATCGATCTAAATCCCGATC AAATCCC
26 GAGCTAGATCGATCGATCTAAATCCCGATCG AAATCCC
27 AGCTAGATCGATCGATCTAAATCCCGATCGA AAATCCC
28 GCTAGATCGATCGATCTAAATCCCGATCGAT AAATCCC
29 CTAGATCGATCGATCTAAATCCCGATCGATT AAATCCC
30 TAGATCGATCGATCTAAATCCCGATCGATTC AAATCCC
31 AGATCGATCGATCTAAATCCCGATCGATTCC AAATCCC
32 GATCGATCGATCTAAATCCCGATCGATTCCG AAATCCC
33 ATCGATCGATCTAAATCCCGATCGATTCCGA AAATCCC
34 TCGATCGATCTAAATCCCGATCGATTCCGAG AAATCCC
35 CGATCGATCTAAATCCCGATCGATTCCGAGC AAATCCC
36 GATCGATCTAAATCCCGATCGATTCCGAGCG AAATCCC
37 ATCGATCTAAATCCCGATCGATTCCGAGCGC AAATCCC
38 TCGATCTAAATCCCGATCGATTCCGAGCGCG AAATCCC
39 CGATCTAAATCCCGATCGATTCCGAGCGCGA AAATCCC

Use of Minimizer

You will notice that the minimizer stays the same for several consecutive 31-mers in the read. This is the key property that is useful in developing efficient bioinformatics algorithms using minimizers.

To explain the advantage of using minimizers, we start with the simple problem of counting k-mers within a set of reads. In this problem, one starts with a collection of reads and finds the counts of all subsequences of certain size (k-mers) within it.

For a small number of reads, the entire counting can be done in memory by using a hash, but for a large set of reads, the requirement to hold both k-mers and counts will require large amount of RAM, eventually exceeding the limits of the given computer. Therefore, the counting needs to be done with the assistance of disk-based storage. In this approach, each read is split into k-mers and those k-mers are sent to different bins based on the first few nucleotides in them. After all reads are partitioned, each bin is loaded separately and its k-mers are counted by using the same hash method as described above.

If the k-mers are separated for binning based on the first five nuclotides, 4x4x4x4x4=1024 bins will be needed. Also, if we bin the 39 k-mers in the table above based on the first five nucleotides, they will go to 34 different bins.

Minimizer-based Binning

A minimizer-based binning algorithm instead uses the minimizer of a k-mer to bin it. The algorithm works, because each k-mer has a unique minimizer of a certain size and will therefore always go to the same bin. The rest of the algorithm stays the same. Minimizer-mer based binning provides huge advantage to the binning process. The 39 k-mers will go to only two bins instead of 34 bins. Moreover, with minimizer-based binning, it is not necessary to split the read into all its k-mer components. Instead, we can send the whole subsequence with the same minimizer to the corresponding bin. The splitting of the subsequence into k-mers can be done later during the counting step, thus greatly reducing the storage requirement for the bins.

The original 100 nucleotide read will be split into only five segments instead of 94 kmers, as shown below.

Segment Minimizer-based bin
ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCC AGAGCTA
ATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAG AAATCCC
AATCCCGATCGATTCCGAGCGCGATCAAAGC AATCCCG
ATCCCGATCGATTCCGAGCGCGATCAAAGCG AATCGAT
TCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA AAAGCGC

Minimizer Code in Python

The Python code to demonstrate the minimizer concept of the previous section are shown below. We are presenting the simplest code that exactly replicates the procedure presented in the previous section. You will notice that many redundant calculations are done to find minimizers for every k-mer on a sliding windown. Therefore, it is possible to make the code efficient with minor effort.

seq="ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA"

rev=seq[::-1]

rev=rev.replace("A","X")
rev=rev.replace("T","A")
rev=rev.replace("X","T")
rev=rev.replace("C","X")
rev=rev.replace("G","C")
rev=rev.replace("X","G")

Kmer=31
M=7
L=len(seq)

for i in range(0, L-Kmer+1):

        sub_f=seq[i:i+Kmer]
        sub_r=rev[L-Kmer-i:L-i]

        min="ZZZZZZZZZZZZZ"
        for j in range(0, Kmer-M+1):
                sub2=sub_f[j:j+M]
                if sub2 < min:
                        min=sub2
                sub2=sub_r[j:j+M]
                if sub2 < min:
                        min=sub2

        print sub_f,min

Written by M. //