# Compact Universal Set of Minimizers

There has been a number of interesting recent developments on minimizers likely to make bioinformatics algorithms even more efficient. In this post, we like to mention three papers by Y. Orenstein, G. Marçais and collaborators.

For those unfamiliar with minimizers, it is an algorithmic concept proposed in 2004 by Robert et al. to bin k-mers from a longer sequence efficiently. If k-mers are binned based on first six or seven nucleotides, every k-mer from a 100 nucleotide read is likely to go to a different bin. In contrast, minimizer-based binning maintains information about the read contiguity. I posted an introductory tutorial here (in the Bioinformatics lgorithms section), where I worked out a case to show how everything fits together.

Despite their efficiency, minimizers got ignored in bioinformatics for almost 10 years. Then in 2012, the MSPkmer algorithm used it to greatly increase the efficiency of disk-based k-mer counting. Since then, minimizers have been used to perform long read alignment, compact de Bruin graph, search through microbial genomes and in other places. In every case, minimizer-based algorithms performed way better than the existing algorithms.

One challenge in using minimizers is how to order the minimizer sequences so that (i) the bin sizes stay nearly balanced, (ii) number of bins is minimal. Orenstein et al. developed a heuristic algorithm to derive near minimal set of minimizers to cover k-mers of different sizes. They first showed that the problem of finding minimal set of minimizer was NP-complete. Then, to derive an approximate set, they combined two steps. First, they determined a decycling set in a complete de Bruijn graph of order k using Mykkeltveit’s algorithm and removed it from the graph. Then, to cover the remaining sequences, they repeatedly removed a vertex v with the largest hitting number Tl(v) until there was no l-long paths. Tl(u) for all remaining u were recomputed after each removal.

## Abstracts

### Compact Universal k-mer Hitting Sets

We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at acgt.cs.tau.ac.il/docks/.

### Improving the performance of minimizers and winnowing schemes

MOTIVATION: The minimizers scheme is a method for selecting k -mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k -mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k -mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.

RESULTS: We provide an in-depth analysis of the effect of k -mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al. ) on the expected density of minimizers in a random sequence.

AVAILABILITY AND IMPLEMENTATION: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git .

### Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.