Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Burrows Wheeler Transform

Note: You can View Burrows Wheeler Transform in Animation <A href=http://www.homolog.us/blogs/blog/2012/04/03/burrows-wheeler-transform-in-animation/>here</A>.


Burrows Wheeler Transform (BWT) is a reversible transform. It starts with a sequence or other type of text, and turns it into something else. BWT is useful, because the transformed sequence can be compressed far more easily than the original sequence. The popular bzip2 algorithm for file compression uses this transform. By the same token, transformed sequence can be loaded with much less RAM than the original sequence. This is important for mapping of RNAseq reads, because one can load the transformed human genome into the RAM of a laptop and do search for millions of Illumina reads.

How is the transform done?

Let us consider the word ‘homolog.us$’. We added the character ‘$’ at the end, because inverse BWT is reversible only up to giving the circular sequence and does not give the begin and end of original sequence. With ‘$’ marking the end of the sequence, it becomes trivial to identify the begin and end from the circular sequence.

To transform ‘homolog.us$’, the same word is written 11 times with one letter sent from front to back in each iteration.

Figure

Please note that each row and each column has all characters of the word ‘homolog.us$’ by way of construction.

In the following step, all 11 lines (rows) in the above matrix are sorted alphabetically. The last line (‘$homolog.us’) starts with ‘$’ and moves to the top. Then comes (‘.us$homolog’) and so on. The sorted matrix is shown below, and you can see that the characters in the first column follow alphabetic order as it should be with any sort. The lines with identical characters in the first place (‘og.us$homol’, ‘olog.us$hom’ and ‘omolog.us$h’) got sorted on the basis of second characters. Do make a mental note of the above point, because it is important in understanding LF mapping property explained below.

Figure

After sorting is done, last column (‘sg$oolmhu.’) is called the BWT transform of the original sequence (‘homolog.us$’). Note again that each column always has all characters of ‘homolog.us$’, because sorting does not change the columns but only rearranges them. Also note that the first column is a sorted list of all characters in ‘homolog.us$’.


Web Statistics