Digging into BWA-mem Code

write-blog

As you can tell, we are in very early stage of figuring out what is going on.

Algorithm-wise, Lam’s group integrated BWT and Smith Waterman. That was a major breakthrough, and it made Smith Waterman competitive with BLAST. (It was described in the first paper of previous commentary.)

-—————–

Second improvement came from their incorporation of bidirectional BWT. (Please check second paper of previous commentary).

BWA incorporates something called FMD-index described in the Fermi paper (third one in prev. commentary) as

In comparison to the bidirectional BWT, which uses two FM-indices, the FMD- index builds both forward and reverse strand DNA sequences in one index. Although the FMD-index is not applicable to generic texts, it is conceptually more consistent with double-strand DNA and improves the speed of exact matching as we only need to search against one index. For example, DWA-SW gets a 80% speedup when we adopt the FMD-index as the data structure.

-——————

Third conceptually important step is ‘supermaximal exact matches’ as described in Fermi paper.

An FMD-index can be used to find supermaximal exact matches (SMEMs) between a reference and a query sequence. Formally, a maximal exact match (MEM) is an exact match that cannot be extended in either direction of the match. An SMEM is a MEM that is not considered in other MEMs on the query sequence. Fermi uses SMEMs to map reads back to the unitig.

-——————-

Fourth important step is seeding and reseeding strategy of BMA-MEM, described in the fourth paper of the list.

-——————–

We believe it is important to fully understand the four conceptual steps mentioned above before BWA-MEM code starts to talk to you.

Remaining BWA-MEM files are here -

write-blog

-———–

1. klib - Attractive Chaos

Klib is a standalone and lightweight C library distributed under MIT/X11 license. Most components are independent of external libraries, except the standard C library, and independent of each other. To use a component of this library, you only need to copy a couple of files to your source code tree without worrying about library dependencies.

Klib strives for efficiency and a small memory footprint. Some components, such as khash.h, kbtree.h, ksort.h and kvec.h, are among the most efficient implementations of similar algorithms or data structures in all programming languages, in terms of both speed and memory use.

2. Qsufsort

FASTER SUFFIX SORTING - N. JESPER LARSSON AND KUNIHIKO SADAKANE

Abstract. We propose a fast and memory ecient algorithm for lexico-

graphically sorting the suxes of a string, a problem that has important

applications in data compression as well as string matching.

Our algorithm eliminates much of the overhead of previous specialized

approaches while maintaining their robustness for all kinds of input. For

input size n, our algorithm operates in only two integer arrays of size n,

and has worst case time complexity O(nlogn).

We demonstrate experimentally that our algorithm has favourable

performance compared to other approaches, and argue that our algo-

rithm is the prime choice for general sux sorting.

‹»The Potential of Preprints to Accelerate Scholarly Communication - A Bibliometric Analysis Based on Selected Journals« »Richard Durbin at International Maths Olympiad«›