Minimizer - An Introductory Tutorial
This is a condensed version of our longer tutorial on minimizer algorithms available here.
Many bioinformatics algorithms use short substrings of a longer sequence, commonly known as k-mers, for indexing, search or assembly. Minimizers allow efficient binning of those k-mers so that some information about the sequence contiguity is preserved.
Bioinformatics programs using minimizers -
Name | Place of use |
---|---|
kmc2 | K-mer counting |
bcalm,bcalm2 | Compacting dBG |
kraken | Sequence classification |
minimap | Mapping long reads |
The use of minimizers in bioinformatics was proposed by Roberts et al in 2004. However, the method did not see any use for almost 10 years until the next-generation sequencing data overwhelmed the computing capacities using existing popular programs. In 2012, MSPkmer algorithm applied minimizers to greatly increase the efficiency of disk-based k-mer counting. Since then, minimizers have been utilized to align long reads, compact de Bruijn graphs, search through microbial genomes and in other places. In every case, minimizer-based algorithms performed substantially better than the existing algorithms.
The strength of minimizer comes from its ability to use sequence contiguity, while binning the k-mers. A naive alternative would be to distribute the k-mers based on their first six or seven nucleotides. In that case, every k-mer from a 100 nucleotide read is likely to go to a different bin. In contrast, minimizer-based binning fragments the read into ~3-4 pieces. In this tutorial, I worked out an example to explain the concept.
Definition of a Minimizer
Consider a 100-nucleotide read.
ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA
If we compile all 7-mer subsequences and their reverse complements from the entire read, we will get 94 pairs of words. A subset is shown below.
Position | 7-mer (forward strand) | 7-mer (reverse strand) |
---|---|---|
1 | ATGCGAT | ATCGCAT |
2 | TGCGATA | TATCGCA |
3 | GCGATAT | ATATCGC |
4 | CGATATC | GATATCG |
5 | GATATCG | CGATATC |
6 | ATATCGT | ACGATAT |
7 | TATCGTA | TACGATA |
8 | ATCGTAG | CTACGAT |
9 | TCGTAGG | CCTACGA |
10 | CGTAGGC | GCCTACG |
11 | GTAGGCG | CGCCTAC |
12 | TAGGCGT | ACGCCTA |
…… | ||
94 | GCTAGCA | TGCTAGC |
When those 188 words are arranged in an alphabetical order, AAAGCGC at position 72 of the forward strand comes first. Therefore, AAAGCGC is the 7-mer minimizer for the entire read. Similarly, we can identify minimizer for any other size by compiling subsequences of that size and sorting them alphabetically.
We can generalize the above concept by finding the minimizers for the subsequences of the original read. To demonstrate, let us find the minimizers for all 31-nucleotide subsequence. The first 39 subsequences are shown below.
Position | 31-mer | 7-mer minimizer |
---|---|---|
1 | ATGCGATATCGTAGGCGTCGATGGAGAGCTA | AGAGCTA |
2 | TGCGATATCGTAGGCGTCGATGGAGAGCTAG | AGAGCTA |
3 | GCGATATCGTAGGCGTCGATGGAGAGCTAGA | AGAGCTA |
4 | CGATATCGTAGGCGTCGATGGAGAGCTAGAT | AGAGCTA |
5 | GATATCGTAGGCGTCGATGGAGAGCTAGATC | AGAGCTA |
6 | ATATCGTAGGCGTCGATGGAGAGCTAGATCG | AGAGCTA |
7 | TATCGTAGGCGTCGATGGAGAGCTAGATCGA | AGAGCTA |
8 | ATCGTAGGCGTCGATGGAGAGCTAGATCGAT | AGAGCTA |
9 | TCGTAGGCGTCGATGGAGAGCTAGATCGATC | AGAGCTA |
10 | CGTAGGCGTCGATGGAGAGCTAGATCGATCG | AGAGCTA |
11 | GTAGGCGTCGATGGAGAGCTAGATCGATCGA | AGAGCTA |
12 | TAGGCGTCGATGGAGAGCTAGATCGATCGAT | AGAGCTA |
13 | AGGCGTCGATGGAGAGCTAGATCGATCGATC | AGAGCTA |
14 | GGCGTCGATGGAGAGCTAGATCGATCGATCT | AGAGCTA |
15 | GCGTCGATGGAGAGCTAGATCGATCGATCTA | AGAGCTA |
16 | CGTCGATGGAGAGCTAGATCGATCGATCTAA | AGAGCTA |
17 | GTCGATGGAGAGCTAGATCGATCGATCTAAA | AGAGCTA |
18 | TCGATGGAGAGCTAGATCGATCGATCTAAAT | AGAGCTA |
19 | CGATGGAGAGCTAGATCGATCGATCTAAATC | AGAGCTA |
20 | GATGGAGAGCTAGATCGATCGATCTAAATCC | AGAGCTA |
21 | ATGGAGAGCTAGATCGATCGATCTAAATCCC | AAATCCC |
22 | TGGAGAGCTAGATCGATCGATCTAAATCCCG | AAATCCC |
23 | GGAGAGCTAGATCGATCGATCTAAATCCCGA | AAATCCC |
24 | GAGAGCTAGATCGATCGATCTAAATCCCGAT | AAATCCC |
25 | AGAGCTAGATCGATCGATCTAAATCCCGATC | AAATCCC |
26 | GAGCTAGATCGATCGATCTAAATCCCGATCG | AAATCCC |
27 | AGCTAGATCGATCGATCTAAATCCCGATCGA | AAATCCC |
28 | GCTAGATCGATCGATCTAAATCCCGATCGAT | AAATCCC |
29 | CTAGATCGATCGATCTAAATCCCGATCGATT | AAATCCC |
30 | TAGATCGATCGATCTAAATCCCGATCGATTC | AAATCCC |
31 | AGATCGATCGATCTAAATCCCGATCGATTCC | AAATCCC |
32 | GATCGATCGATCTAAATCCCGATCGATTCCG | AAATCCC |
33 | ATCGATCGATCTAAATCCCGATCGATTCCGA | AAATCCC |
34 | TCGATCGATCTAAATCCCGATCGATTCCGAG | AAATCCC |
35 | CGATCGATCTAAATCCCGATCGATTCCGAGC | AAATCCC |
36 | GATCGATCTAAATCCCGATCGATTCCGAGCG | AAATCCC |
37 | ATCGATCTAAATCCCGATCGATTCCGAGCGC | AAATCCC |
38 | TCGATCTAAATCCCGATCGATTCCGAGCGCG | AAATCCC |
39 | CGATCTAAATCCCGATCGATTCCGAGCGCGA | AAATCCC |
Use of Minimizer
You will notice that the minimizer stays the same for several consecutive 31-mers in the read. This is the key property that is useful in developing efficient bioinformatics algorithms using minimizers.
To explain the advantage of using minimizers, we start with the simple problem of counting k-mers within a set of reads. In this problem, one starts with a collection of reads and finds the counts of all subsequences of certain size (k-mers) within it.
For a small number of reads, the entire counting can be done in memory by using a hash, but for a large set of reads, the requirement to hold both k-mers and counts will require large amount of RAM, eventually exceeding the limits of the given computer. Therefore, the counting needs to be done with the assistance of disk-based storage. In this approach, each read is split into k-mers and those k-mers are sent to different bins based on the first few nucleotides in them. After all reads are partitioned, each bin is loaded separately and its k-mers are counted by using the same hash method as described above.
If the k-mers are separated for binning based on the first five nuclotides, 4x4x4x4x4=1024 bins will be needed. Also, if we bin the 39 k-mers in the table above based on the first five nucleotides, they will go to 34 different bins.
Minimizer-based Binning
A minimizer-based binning algorithm instead uses the minimizer of a k-mer to bin it. The algorithm works, because each k-mer has a unique minimizer of a certain size and will therefore always go to the same bin. The rest of the algorithm stays the same. Minimizer-mer based binning provides huge advantage to the binning process. The 39 k-mers will go to only two bins instead of 34 bins. Moreover, with minimizer-based binning, it is not necessary to split the read into all its k-mer components. Instead, we can send the whole subsequence with the same minimizer to the corresponding bin. The splitting of the subsequence into k-mers can be done later during the counting step, thus greatly reducing the storage requirement for the bins.
The original 100 nucleotide read will be split into only five segments instead of 94 kmers, as shown below.
Segment | Minimizer-based bin |
---|---|
ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCC | AGAGCTA |
ATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAG | AAATCCC |
AATCCCGATCGATTCCGAGCGCGATCAAAGC | AATCCCG |
ATCCCGATCGATTCCGAGCGCGATCAAAGCG | AATCGAT |
TCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA | AAAGCGC |
Minimizer Code in Python
The Python code to demonstrate the minimizer concept of the previous section are shown below. We are presenting the simplest code that exactly replicates the procedure presented in the previous section. You will notice that many redundant calculations are done to find minimizers for every k-mer on a sliding windown. Therefore, it is possible to make the code efficient with minor effort.
seq="ATGCGATATCGTAGGCGTCGATGGAGAGCTAGATCGATCGATCTAAATCCCGATCGATTCCGAGCGCGATCAAAGCGCGATAGGCTAGCTAAAGCTAGCA"
rev=seq[::-1]
rev=rev.replace("A","X")
rev=rev.replace("T","A")
rev=rev.replace("X","T")
rev=rev.replace("C","X")
rev=rev.replace("G","C")
rev=rev.replace("X","G")
Kmer=31
M=7
L=len(seq)
for i in range(0, L-Kmer+1):
sub_f=seq[i:i+Kmer]
sub_r=rev[L-Kmer-i:L-i]
min="ZZZZZZZZZZZZZ"
for j in range(0, Kmer-M+1):
sub2=sub_f[j:j+M]
if sub2 < min:
min=sub2
sub2=sub_r[j:j+M]
if sub2 < min:
min=sub2
print sub_f,min