Using Multidimensional Bloom filters to Search RNAseq Libraries - (i)

A number of recent papers are proposing to use multidimensional Bloom filters to identify genes from a large collection of RNAseq libraries. This post provides general perspective on these papers. In a later post, we will go in depth and explain the algorithm of the recent preprint by carrying out an example.

Bloom filters in general found many uses in bioinformatics, and we covered them in our blog in the past. We now have a new tutorial on Bloom filter for those unfamiliar with this probabilistic data structure and its applications in bioinformatics. The concept of multidimensional Bloom filter goes a level higher than oridinary Bloom filters, because it is meant to store and query a large collection of Bloom filters. Each Bloom filter within the set may store a representation (usually kmer-based) of a genome or RNAseq library.

The ‘big data’ problem addressed by the new papers is as follows. Let us say someone has a large collection of RNAseq libraries and wants to determine whether a gene is expressed in some of them. Once the subset of libraries are identified, the researcher can run additional downstream analysis steps on the subset. This saves time from unnecessarily transferring and processing thousands of uninformative libraries.

If the kmers within each library are stored as a Bloom filter for efficiency, the task of identifying the subset can be solved in through -

split the given gene into kmers,
check each kmer with the Bloom filter for presence or absence,
declare the gene to be present, if a substantial number of kmers are present. All disclaimer about the probabilistic nature of Bloom filter apply.

For 1,000 independent RNAseq libraries, The above process would require cheking 1,000 separate Bloom filters. A multidimensional Bloom filter can make the step efficient by removing redundancies. We discuss various relevant papers in the following sections.

Bloofi

In 2013, Crainiceanu proposed an algorithm for efficiently searching through a collection of Bloom filters. In operational terms, a Bloom filter is a bitvector, i.e. an array of 1s and 0s. Bloofi, the hierarchical index structure proposed in the paper arranges those bitvectors in the form of a B+ tree.

Adina Crainiceanu, “Bloofi: A Hierarchical Bloom Filter Index with Applications to Distributed Data Provenance”, 2013 download.
Adina Crainiceanu and Daniel Lemire. “Bloofi: Multidimensional Bloom filters”, Information Systems, 54:311{324}, 2015 download at arxiv.

Sequence Bloom Tree (SBT)

In 2015, Solomon and Kingsford used Bloofi-like algorithm in bioinformatics to build an application for querying genes in a vast collection of RNAseq libraries. Their method had couple of improvement over Bloofi, including using RRR compression scheme for storing the Bloom filters.

Brad Solomon, Carleton Kingsford, “Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees”, 2015 download.

Split Sequence Bloom Tree (SSBT) and allsome-SBT

Two recent papers removed redundant tasks from SBT to improve their speed of construction and query, as well as the storage space. The primary idea is that many common kmers show up in all RNAseq data, and it does not make sense to search for them in every Bloom filter. If the signatures for those kmers are marked in the parent nodes of the binary tree, the search can be concluded at an early step. Secondly, the tree can be built more intelligently. In current form, the grouping is dependent on the first few.

In the following post, we will discuss those improvements in more detail by going over the construction of a tree with real data.

Brad Solomon, Carleton Kingsford, “Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees”, 2016 download.
Chen Sun, Robert S. Harris, Rayan Chikhi, Paul Medvedev, AllSome Sequence Bloom Trees, 2016 link.

‹»Postdoctoral Scholar Position in Comparative Plant Genomics and Bioinformatics« »GRASS for Rapid Reannotation of RNAseq Data«›