Minimizer - An Introductory Tutorial

This is a condensed version of our longer tutorial on minimizer algorithms available here.

Many bioinformatics algorithms use short substrings of a longer sequence, commonly known as k-mers, for indexing, search or assembly. Minimizers allow efficient binning of those k-mers so that some information about the sequence contiguity is preserved.

Bioinformatics programs using minimizers -

Name	Place of use
kmc2	K-mer counting
bcalm,bcalm2	Compacting dBG
kraken	Sequence classification
minimap	Mapping long reads

The use of minimizers in bioinformatics was proposed by Roberts et al in 2004. However, the method did not see any use for almost 10 years until the next-generation sequencing data overwhelmed the computing capacities using existing popular programs. In 2012, MSPkmer algorithm applied minimizers to greatly increase the efficiency of disk-based k-mer counting. Since then, minimizers have been utilized to align long reads, compact de Bruijn graphs, search through microbial genomes and in other places. In every case, minimizer-based algorithms performed substantially better than the existing algorithms.

The strength of minimizer comes from its ability to use sequence contiguity, while binning the k-mers. A naive alternative would be to distribute the k-mers based on their first six or seven nucleotides. In that case, every k-mer from a 100 nucleotide read is likely to go to a different bin. In contrast, minimizer-based binning fragments the read into ~3-4 pieces. In this tutorial, I worked out an example to explain the concept.

Definition of a Minimizer

Consider a 100-nucleotide read.

ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA

If we compile all 7-mer subsequences and their reverse complements from the entire read, we will get 94 pairs of words. A subset is shown below.

Position	7-mer (forward strand)	7-mer (reverse strand)
1	ATGCGAT	ATCGCAT
2	TGCGATA	TATCGCA
3	GCGATAT	ATATCGC
4	CGATATC	GATATCG
5	GATATCG	CGATATC
6	ATATCGT	ACGATAT
7	TATCGTA	TACGATA
8	ATCGTAG	CTACGAT
9	TCGTAGG	CCTACGA
10	CGTAGGC	GCCTACG
11	GTAGGCG	CGCCTAC
12	TAGGCGT	ACGCCTA
……
94	GCTAGCA	TGCTAGC

When those 188 words are arranged in an alphabetical order, AAAGCGC at position 72 of the forward strand comes first. Therefore, AAAGCGC is the 7-mer minimizer for the entire read. Similarly, we can identify minimizer for any other size by compiling subsequences of that size and sorting them alphabetically.

We can generalize the above concept by finding the minimizers for the subsequences of the original read. To demonstrate, let us find the minimizers for all 31-nucleotide subsequence. The first 39 subsequences are shown below.

Position	31-mer	7-mer minimizer
1	ATGCGATATCGTAGGCGTCGATGGAGAGCTA	AGAGCTA
2	TGCGATATCGTAGGCGTCGATGGAGAGCTAG	AGAGCTA
3	GCGATATCGTAGGCGTCGATGGAGAGCTAGA	AGAGCTA
4	CGATATCGTAGGCGTCGATGGAGAGCTAGAT	AGAGCTA
5	GATATCGTAGGCGTCGATGGAGAGCTAGATC	AGAGCTA
6	ATATCGTAGGCGTCGATGGAGAGCTAGATCG	AGAGCTA
7	TATCGTAGGCGTCGATGGAGAGCTAGATCGA	AGAGCTA
8	ATCGTAGGCGTCGATGGAGAGCTAGATCGAT	AGAGCTA
9	TCGTAGGCGTCGATGGAGAGCTAGATCGATC	AGAGCTA
10	CGTAGGCGTCGATGGAGAGCTAGATCGATCG	AGAGCTA
11	GTAGGCGTCGATGGAGAGCTAGATCGATCGA	AGAGCTA
12	TAGGCGTCGATGGAGAGCTAGATCGATCGAT	AGAGCTA
13	AGGCGTCGATGGAGAGCTAGATCGATCGATC	AGAGCTA
14	GGCGTCGATGGAGAGCTAGATCGATCGATCT	AGAGCTA
15	GCGTCGATGGAGAGCTAGATCGATCGATCTA	AGAGCTA
16	CGTCGATGGAGAGCTAGATCGATCGATCTAA	AGAGCTA
17	GTCGATGGAGAGCTAGATCGATCGATCTAAA	AGAGCTA
18	TCGATGGAGAGCTAGATCGATCGATCTAAAT	AGAGCTA
19	CGATGGAGAGCTAGATCGATCGATCTAAATC	AGAGCTA
20	GATGGAGAGCTAGATCGATCGATCTAAATCC	AGAGCTA
21	ATGGAGAGCTAGATCGATCGATCTAAATCCC	AAATCCC
22	TGGAGAGCTAGATCGATCGATCTAAATCCCG	AAATCCC
23	GGAGAGCTAGATCGATCGATCTAAATCCCGA	AAATCCC
24	GAGAGCTAGATCGATCGATCTAAATCCCGAT	AAATCCC
25	AGAGCTAGATCGATCGATCTAAATCCCGATC	AAATCCC
26	GAGCTAGATCGATCGATCTAAATCCCGATCG	AAATCCC
27	AGCTAGATCGATCGATCTAAATCCCGATCGA	AAATCCC
28	GCTAGATCGATCGATCTAAATCCCGATCGAT	AAATCCC
29	CTAGATCGATCGATCTAAATCCCGATCGATT	AAATCCC
30	TAGATCGATCGATCTAAATCCCGATCGATTC	AAATCCC
31	AGATCGATCGATCTAAATCCCGATCGATTCC	AAATCCC
32	GATCGATCGATCTAAATCCCGATCGATTCCG	AAATCCC
33	ATCGATCGATCTAAATCCCGATCGATTCCGA	AAATCCC
34	TCGATCGATCTAAATCCCGATCGATTCCGAG	AAATCCC
35	CGATCGATCTAAATCCCGATCGATTCCGAGC	AAATCCC
36	GATCGATCTAAATCCCGATCGATTCCGAGCG	AAATCCC
37	ATCGATCTAAATCCCGATCGATTCCGAGCGC	AAATCCC
38	TCGATCTAAATCCCGATCGATTCCGAGCGCG	AAATCCC
39	CGATCTAAATCCCGATCGATTCCGAGCGCGA	AAATCCC

Use of Minimizer

You will notice that the minimizer stays the same for several consecutive 31-mers in the read. This is the key property that is useful in developing efficient bioinformatics algorithms using minimizers.

To explain the advantage of using minimizers, we start with the simple problem of counting k-mers within a set of reads. In this problem, one starts with a collection of reads and finds the counts of all subsequences of certain size (k-mers) within it.

For a small number of reads, the entire counting can be done in memory by using a hash, but for a large set of reads, the requirement to hold both k-mers and counts will require large amount of RAM, eventually exceeding the limits of the given computer. Therefore, the counting needs to be done with the assistance of disk-based storage. In this approach, each read is split into k-mers and those k-mers are sent to different bins based on the first few nucleotides in them. After all reads are partitioned, each bin is loaded separately and its k-mers are counted by using the same hash method as described above.

If the k-mers are separated for binning based on the first five nuclotides, 4x4x4x4x4=1024 bins will be needed. Also, if we bin the 39 k-mers in the table above based on the first five nucleotides, they will go to 34 different bins.

Minimizer-based Binning

A minimizer-based binning algorithm instead uses the minimizer of a k-mer to bin it. The algorithm works, because each k-mer has a unique minimizer of a certain size and will therefore always go to the same bin. The rest of the algorithm stays the same. Minimizer-mer based binning provides huge advantage to the binning process. The 39 k-mers will go to only two bins instead of 34 bins. Moreover, with minimizer-based binning, it is not necessary to split the read into all its k-mer components. Instead, we can send the whole subsequence with the same minimizer to the corresponding bin. The splitting of the subsequence into k-mers can be done later during the counting step, thus greatly reducing the storage requirement for the bins.

The original 100 nucleotide read will be split into only five segments instead of 94 kmers, as shown below.

Segment	Minimizer-based bin
ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCC	AGAGCTA
ATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAG	AAATCCC
AATCCCGATCGATTCCGAGCGCGATCAAAGC	AATCCCG
ATCCCGATCGATTCCGAGCGCGATCAAAGCG	AATCGAT
TCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA	AAAGCGC

Minimizer Code in Python

The Python code to demonstrate the minimizer concept of the previous section are shown below. We are presenting the simplest code that exactly replicates the procedure presented in the previous section. You will notice that many redundant calculations are done to find minimizers for every k-mer on a sliding windown. Therefore, it is possible to make the code efficient with minor effort.

seq="ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA"

rev=seq[::-1]

rev=rev.replace("A","X")
rev=rev.replace("T","A")
rev=rev.replace("X","T")
rev=rev.replace("C","X")
rev=rev.replace("G","C")
rev=rev.replace("X","G")

Kmer=31
M=7
L=len(seq)

for i in range(0, L-Kmer+1):

        sub_f=seq[i:i+Kmer]
        sub_r=rev[L-Kmer-i:L-i]

        min="ZZZZZZZZZZZZZ"
        for j in range(0, Kmer-M+1):
                sub2=sub_f[j:j+M]
                if sub2 < min:
                        min=sub2
                sub2=sub_r[j:j+M]
                if sub2 < min:
                        min=sub2

        print sub_f,min

‹»Compact Universal Set of Minimizers« »Is '23 and Me' Misleading its Jewish Customers on Ancestry?«›