External Memory Generalized Suffix and LCP Arrays Construction

Felipe A. Louza, one of our readers, posted link to his paper that others may find helpful. It is accepted at 24th Annual Symposium on Combinatorial Pattern Matching (CPM 2013).

A suffix array is a data structure that, together with the LCP array, allows solving many string processing problems in a very efficient fashion. In this article we introduce eGSA, the first external memory algorithm to construct both generalized suffix and LCP arrays for sets of strings. Our algorithm relies on a combination of buffers, induced sorting and a heap. Performance tests with real DNA sequence sets of size up to 8.5 GB showed that eGSA can indeed be applied to sets of large sequences with efficient running time on a low-cost machine. Compared to the algorithm that most closely resembles eGSA purpose, eSAIS, eGSA reduced the time spent to construct the arrays by a factor of 2.54.8.

Our previous threads on similar topic are -

On Constructing Suffix Arrays and LCP Arrays in External Memory

Construction of Suffix Array in External Memory Follow Up

Also, in this context, we like to mention another (old) paper that readers may enjoy. We believe we first saw it by digging through Jared Simpson’s SGA code.

Dynamic Extended Suffix Arrays

by M. Salson, T. Lecroq, M. Leonard, L. Mouchard.

In this article, we are presenting an algorithm that modifi es the suffix array and the Longest Common Prefix (LCP) array when the text is edited (insertion, substitution or deletion of a letter or a factor). This algorithm is based on a recent four-stage algorithm developed for dynamic Burrows- Wheeler Transforms (BWT). For minimizing the space complexity, we are sampling the Suffix Array, a technique used in BWT-based compressed indexes. We furthermore explain how this technique can be adapted for maintaining a sample of the Extended Suffix Array, containing a sample of the Suffix Array, a sample of the Inverse Suffix Array and the whole LCP array. Our practical experiments show that it operates very well in practice, being quicker than the fastest suffix array construction algorithm.

Slides:

Dynamic Burrows Wheeler Transform

‹»Ways to Find Homologous Genes« »How about a Chaotic Genome Assembler?«›