Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Lyndon Word and Lyndon Factorization

We came across a very interesting paper, but before we get to it, let us define few terms with help from wikipedia.

Lyndon Word

In mathematics, in the areas of combinatorics and computer science, a Lyndon word is a string that is strictly smaller in lexicographic order than all of its rotations. …

Connection to de Bruijn sequences

If one concatenates together, in lexicographic order, all the Lyndon words that have length dividing a given number n, the result is a de Bruijn sequence, a circular sequence of symbols such that each possible length-n sequence appears exactly once as one of its contiguous subsequences. For example, the concatenation of the binary Lyndon words whose length divides four is 0 0001 0011 01 0111 1 This construction, together with the efficient generation of Lyndon words, provides an efficient method for constructing a particular de Bruijn sequence in linear time and logarithmic space.

Monoid factorisation and Chen–Fox–Lyndon theorem

A Lyndon word over a totally ordered alphabet A is a word which is lexicographically less than all its rotations.[1] The Chen–Fox–Lyndon theorem states that every string may be formed in a unique way by concatenating a non-increasing sequence of Lyndon words. Hence taking Xl to be the singleton set {l} for each Lyndon word l, with the index set L of Lyndon words ordered lexicographically, we obtain a factorisation of A*.[2] Such a factorisation can be found in linear time.

Now, we can discuss the paper. Giovanna Rosone, one of the authors, is a contributor to BCR algorithm that we covered earlier, when Heng Li brought them to our attention in a comment. Please enjoy the beautiful math of the new paper.

Sorting suffixes of a text via its Lyndon Factorization

The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios.

“The goal of this paper is to introduce a new strategy for the sorting of the suffixes of a word that opens new scenarios of the computation of the SA and the bwt.”

The paper presents an elegant way to construct the BWT and Suffix array of a text.

The goal of this paper is to propose a new strategy to compute the bwt and the SA of a text by decomposing it into Lyndon factors and by using the compatibility relation between the sorting of its local and global suffixes.

Lyndon Word

You start with a word and take all its rotated forms. The lexicographically smallest one is the Lyndon word.

LYNDON YNDONL NDONLY DONLYN ONLYND NLYNDO

Which one the smallest among above six? It is DONLYN.

Alternately, a Lyndon word has the property that, whenever it is split into two substrings, the left substring is always lexicographically less than the right substring.

DONLYN also passes the above test. It is a Lyndon word.

Please note that words like ATATATAT are not Lyndon words. Periodic words are not allowed and Lexicographically smallest one has to be uniquely defined.

The concept of Lyndon word was proposed by mathematician Roger Lyndon in 1954.

Lyndon Factorization

Lyndon showed in 1958 that any long text could be broken into a series of Lyndon words.

Every word W ∈ + has a unique factorization W = L_1 L_2 · · ·L_k such that L_1 ≥lex · · · ≥lex L_k is a non-increasing sequence of Lyndon words.

However, it took until 1983, when Duval presented a linear time algorithm to perform the above factorization for any given word.

One key aspect of Duval’s method is that the program can start from the left of a word and get the Lyndon factorizations one by one until reaching the right end. Why that is important will be clear in the next section.

BWT and Suffix Array Through Lyndon Factorization

Mantaci, Restivo, Rosone and Sciortino paper presents a highly parallelizable way of computing BWT and SA by using Lyndon factorization.

This proposition suggests a possible strategy for sorting the list of the suffixes of some word W: – find the Lyndon decomposition of W, L_1L_2 · · ·L_k; – find the sorted list of the suffixes of L_1 and, separately, the sorted list of the suffixes of L_2; – merge the sorted lists in order to obtain the sorted lists of the suffixes of L_1L_2; – find the sorted list of the suffixes of L_3 and merge it to the previous sorted list; – keep on this way until all the Lyndon factors are processed;

You can now guess why Duval’s method is helpful. You can have multiple parallel threads running for BWT computation, with one set finding the Lyndon factorization and other set computing the BWT on the already factorized part of the text. The BWT computation method through merging of parts is explained in the following example.

Let W = aabcabbaabaabdabbaaabbdc. The Lyndon factorization of W is L_1L_2L_3, where L_1 = aabcabb > L_2 = aabaabdabb > L_3 = aaabbdc.

Capture

What is going on? BWT and SA computation of the first Lyndon word (L_1=aabcabb) is trivial. For the second one, 7 is being added to each Suffix array location, 7 being the size of L_1. Additionally, the G array describes how to insert the SA of L_2 within the SA of L_1 to get the SA of L_1 L_2. The process is continued until all Lyndon factorizations are incorporated.

We are continuing to read the paper. If we have any more insight, we will add that below.


Web Statistics