Fast construction of FM-index for long sequence reads

We mentioned Heng Li’s ropebwt2 code at github in the past, but now he discloses the algorithm in a brief paper posted at arxiv. We will spend the rest of the day trying to understand what is going on.

Summary: We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one that practically works for reads of averaged kilobases in length. Availability and implementation: https://github.com/lh3/ropebwt2 Contact: hengli@broadinstitute.org

Is constructing FM index still relevant in the ‘Minimizer’ world?

-——————

Edit.

Rayan Chikhi worked through an example here that you will find helpful.

Algorithm 1 - Illustration

-————————-

Algorithm 1 is InsertIO1(B,P) in the paper, which takes as input:

B (the BWT of a string T#, where # is a sentinel character)

P (any string)

It outputs BWT(T#P$), where $ is a character lexicographically larger than #,

but lexicographically smaller than all other letters.

An order of sentinels is introduced in the Methods section of the paper

($0 < $_1 < … < $(m-1)), but not in the presentation of Algorithm 1.

Yet, it is important.

We will later see that Algorithm 1 would be incorrect if we considered that # = $.

Let’s run Algorithm 1 on an example: InsertIO1(B = “ar#ab”, P = “da”)

Recall the segmentation of B, taking into account the new characters in da$, is:

B_# B_$ B_a B_b B_d B_r,

where

B_# = “a”

B_$ = “”

B_a = “r#”

B_b = “a”

B_d = “”

B_r = “b”.

The steps of the Algorithm are:

‹»An efficient algorithm for accurate computation of the Dirichlet-multinomial log-likelihood function« »How Google Manipulates Its Search Results«›