A new paper by Prashant Pandey et al. got recently published in Cell Systems. Well, it is not new for those who follow Biorxiv. You can find the preprint version at this link. Mantis code is open source and is available from github.
A number of groups are working on the topic of finding expressed genes from a large collection of raw RNAseq data sets. We provided an introduction in an earlier blog post. These programs run as a pre-filter to identify the subset of interesting libraries, and then the researcher can run additional downstream analysis steps on the subset. This two-step process saves time from unnecessarily transferring and processing thousands of uninformative libraries.
The competing algorithms use Bloom filter, a probabilistic data structure. In operational terms, it is a bitvector or an array of 1s and 0s. Bloom filters are built for each RNAseq data set, and then arranged in tree form for faster query. Those interested in learning about Bloom filters can take a look at our tutorial.
Compared to other competing tools (e.g. Split-Sequence Bloom Tree or SSBT), Mantis index is 20% smaller and takes 1/6th time to build. Moreover, querying the index is 6-100x faster, and the result contains no false-positive or false-negative.
The secret sauce of Mantis is Counting Quotient filter, described nicely in this blog post by the first author. It is a Bloom filter alternative that offers several advantages. Quoting from the Mantis paper -
The CQF is a Bloom-filter alternative that offers several advantages over the Bloom filter. First, the CQF supports counting, i.e., queries to the CQF return not only “present” or “absent,” but also an estimate on the number of times the queried item has been inserted. Analogous to the Bloom filter’s false-positive rate, there is a tunable probability δ that the CQF may return a count that is higher than the true count for a queried element. CQFs can also be resized, and CQFs of different sizes can be merged together efficiently. Finally, CQFs can be used in an “exact” mode where they act as a compact exact hash table, i.e., we can make δ = 0. CQFs are also faster and more space efficient than Bloom filters.
Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6–108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days.