Why is BWT useful?
If you transform a very long sequence, BWT brings similar characters together and makes it more efficient to compress them. For example, you see that in the transformed sequence, two ‘o’s are brought together and the third ‘o’ is nearby.
Second property of BWT is useful for performing nucleotide searches. It is called last-to-front mapping (LF mapping). To explain it, we perform the same transform as above, but mark three ‘o’s in different colors.
Here is the original list -
Here is the sorted list -
LF mapping property says that the order of nucleotides in the first column is the same as order of nucleotides in the last column even up to the colors of nucleotides. You can see that three ‘o’s in the first column are sorted as blue, green and red. In the last column, which is the BWT transform of the original sequence, blue ‘o’ came first, green ‘o’ came second and red ‘o’ came third. Please check these slides, if you want to understand LF mapping property mathematically.
There is nothing magical about LF mapping property and you can understand it by just thinking through the sorting process. Three ‘o’s of the last column are followed by characters ‘g’, ‘l’ and ‘m’ in the first column. They are sorted with ‘g’ followed by ‘l’ followed by ‘m’, because first column sorts the characters.
Three ‘o’s of the first column are followed by the same ‘g’, ‘l’ and ‘m’ in the second column. When those lines are sorted, the first column is the same (‘o’), and therefore the sorting order is decided by the second column.
The implications of LF mapping property are profound, as we shall see in the next commentary describing Bowtie search algorithm.
Let us consider another example - TATATATATA$. Just like before, you need to write the word many times with one letter shifted to the back until all letters have been shifted. The second table is obtained by sorting the words within first table alphabetically. The last column gives the Burrows Wheeler transform of the original sequence. Again you see that all Ts and all but one As are clustered together in the transformed sequences.
In the above tables, all As and Ts of the original sequence have different colors so that one can track where they ended in the final sequence. You can see that the color-ordering of As in the last column is the same as the color ordering of As in the first column. Ts also show similar property known as LF mapping property.