Burrow Wheeler Transform - Matlab Code, Animation

Burrow Wheeler Transform - Matlab Code, Animation


A very helpful Matlab code written by Michael Schatz got buried within our daily compilation of bioinformatics news, even though it deserved a full commentary.

Several months back, we wrote a number of commentaries explaining how Burrows Wheeler Transform works, and how it was being used in very fast nucleotide search algorithms (Bowtie, BWT, etc.). You can check them here -

Finding us in homolog.us

Finding us in homolog.us part II

Finding us in homolog.us III (Search Algorithm and Indexing)

We also developed simple animations to show the transform process -

Burrows Wheeler Transform in Animation

The only missing part was a simple code and Michael Schatz’s Matlab code fits in very well. We modified his code to work for two examples in our earlier commentaries - ‘finding US in HOMOLOG.US’ and ‘finding TAT in TATATATATA’. If you do not have Matlab, you can download open-source Octave to run the following codes.

Matlab/Octave code for finding US in HOMOLOG.US


`

%% BWT construction and exact match

%% Michael Schatz (mschatz@cshl.edu)

%%

%% Here is my most basic / easiest to understand solution. It requires O(n^2) space,

%% O(n lg n) time to construct the BWT by sorting the strings with sort(), and O(nm)

%% time to match a query of length m. It is relatively slow and verbose but shows how

%% the algorithm basically works.

%%

%% (minor modification by homolog.us)

%%

clear all

%% We should find a match of qry in genome at 3 and 4

genome=’HOMOLOG.US’;

qry=’US’

%% initialize a few helper variables

genomedollar=strcat(genome, ‘$’);

l=length(genomedollar);

%% create a table of all cyclic permutations

permutations={};

for i=1:l

permutations{i} = strcat(genomedollar(i:l),genomedollar(1:i-1));

end

%% uncomment to print the permutations

%% permutations

%% sort the cyclic permutations to form the bwm

bwm = sort(permutations);

bwm

%disp(‘BWM’);

%for i=1:l

% disp(bwm(i));

%end

%% initialize the bwt, and extract the last column of the bwm

bwt = zeros(1,1);

for i=1:l

s=bwm{i};

bwt(1,i) = s(l:l);

end

bwt=char(bwt);

%% print the bwt

bwt

%% Before we can search the BWT, count occurrences of each character

%% First create a lookup table so that we can quickly jump from

%% a character ($ACGT) to a position in the table (12345)

bases=’$.GHLMOSU’;

char2baseidx=zeros(1,255);

blen=length(bases);

for i=1:blen

c=bases(i:i);

char2baseidx(c)=i;

end

%% now count occurences

basecnt=zeros(1,blen);

for i=1:l

c=bwt(i:i);

ci=char2baseidx(c);

basecnt(ci) = basecnt(ci)+1;

end

rowstart=zeros(1,blen)+1;

for i=2:blen

rowstart(i)=rowstart(i-1)+basecnt(i-1);

end

bases

basecnt

rowstart

%% Note we can reconstruct the first column of the BWM from basecnt

%% And we have the last column from the BWT

p=1;

for i=1:blen

for j = 1:basecnt(i)

disp([bases(i), ‘…’, bwt(p:p)]);

p = p+1;

end

end

%% Now look for a query match using LFc

qlen=length(qry);

%% Initialize that the query can be located anywhere in the genome

top=1;

bot=l+1;

%% process the query in reverse order

for i=qlen:-1:1

qc=qry(i);

disp([‘Processing i=’, num2str(i), ‘ qc=’, qc, ‘ top=’, num2str(top), ‘ bot=’, num2str(bot)]);

%% if bwt(top) was qc, what would be its rank?

ranktop=1;

for j=1:top-1

if(bwt(j:j)==qc)

ranktop = ranktop+1;

end

end

%% if bwt(bot) was qc, what would be its rank?

rankbot = 1;

for j=1:bot-1

if(bwt(j:j)==qc)

rankbot = rankbot+1;

end

end

disp([’ ranktop=’, num2str(ranktop), ‘ rankbot=’, num2str(rankbot)])

%% shift the top and bottom pointers

top=rowstart(char2baseidx(qc))+ranktop-1;

bot=rowstart(char2baseidx(qc))+rankbot-1;

disp([‘top=’, num2str(top), ‘ bot=’, num2str(bot)]);

end

disp([‘found exact matches at top=’, num2str(top), ‘ bot=’, num2str(bot)]);

%% Now print the coordinate in the original genome

%% For each row in the range,

%% iteratively apply the LF until we reach

%% the beginning of the string ($)

for row=top:bot-1

pos = 1;

currow = row;

qc = bwt(currow:currow);

disp([‘looking for occurrence at row=’, num2str(currow), ‘ qc=’, qc]);

while(qc ~= ‘$’)

pos = pos+1;

%% count the rank of qc in the row

rowrank=1;

for j=1:currow-1

if (bwt(j)==qc)

rowrank=rowrank+1

end

end

%% now jump to that row

currow=rowstart(char2baseidx(qc))+rowrank-1;

qc = bwt(currow:currow);

disp([’ jumping to currow=’, num2str(currow), ‘ rowrank=’, num2str(rowrank), ‘ qc=’, qc]);

end

disp([‘The occurrence at row=’, num2str(row), ‘ is at pos=’, num2str(pos)]);

end

`


Matlab/Octave code for finding TAT in TATATATATA


`

%% BWT construction and exact match

%% Michael Schatz (mschatz@cshl.edu)

%%

%% Here is my most basic / easiest to understand solution. It requires O(n^2) space,

%% O(n lg n) time to construct the BWT by sorting the strings with sort(), and O(nm)

%% time to match a query of length m. It is relatively slow and verbose but shows how

%% the algorithm basically works.

%%

%% (minor modification by homolog.us)

%%

clear all

%% We should find a match of qry in genome at 3 and 4

genome=’TATATATATA’;

qry=’TAT’

%% initialize a few helper variables

genomedollar=strcat(genome, ‘$’);

l=length(genomedollar);

%% create a table of all cyclic permutations

permutations={};

for i=1:l

permutations{i} = strcat(genomedollar(i:l),genomedollar(1:i-1));

end

%% uncomment to print the permutations

%% permutations

%% sort the cyclic permutations to form the bwm

bwm = sort(permutations);

bwm

%disp(‘BWM’);

%for i=1:l

% disp(bwm(i));

%end

%% initialize the bwt, and extract the last column of the bwm

bwt = zeros(1,1);

for i=1:l

s=bwm{i};

bwt(1,i) = s(l:l);

end

bwt=char(bwt);

%% print the bwt

bwt

%% Before we can search the BWT, count occurrences of each character

%% First create a lookup table so that we can quickly jump from

%% a character ($ACGT) to a position in the table (12345)

bases=’$ACGT’;

char2baseidx=zeros(1,255);

blen=length(bases);

for i=1:blen

c=bases(i:i);

char2baseidx(c)=i;

end

%% now count occurences

basecnt=zeros(1,blen);

for i=1:l

c=bwt(i:i);

ci=char2baseidx(c);

basecnt(ci) = basecnt(ci)+1;

end

rowstart=zeros(1,blen)+1;

for i=2:blen

rowstart(i)=rowstart(i-1)+basecnt(i-1);

end

bases

basecnt

rowstart

%% Note we can reconstruct the first column of the BWM from basecnt

%% And we have the last column from the BWT

p=1;

for i=1:blen

for j = 1:basecnt(i)

disp([bases(i), ‘…’, bwt(p:p)]);

p = p+1;

end

end

%% Now look for a query match using LFc

qlen=length(qry);

%% Initialize that the query can be located anywhere in the genome

top=1;

bot=l+1;

%% process the query in reverse order

for i=qlen:-1:1

qc=qry(i);

disp([‘Processing i=’, num2str(i), ‘ qc=’, qc, ‘ top=’, num2str(top), ‘ bot=’, num2str(bot)]);

%% if bwt(top) was qc, what would be its rank?

ranktop=1;

for j=1:top-1

if(bwt(j:j)==qc)

ranktop = ranktop+1;

end

end

%% if bwt(bot) was qc, what would be its rank?

rankbot = 1;

for j=1:bot-1

if(bwt(j:j)==qc)

rankbot = rankbot+1;

end

end

disp([’ ranktop=’, num2str(ranktop), ‘ rankbot=’, num2str(rankbot)])

%% shift the top and bottom pointers

top=rowstart(char2baseidx(qc))+ranktop-1;

bot=rowstart(char2baseidx(qc))+rankbot-1;

disp([‘top=’, num2str(top), ‘ bot=’, num2str(bot)]);

end

disp([‘found exact matches at top=’, num2str(top), ‘ bot=’, num2str(bot)]);

%% Now print the coordinate in the original genome

%% For each row in the range,

%% iteratively apply the LF until we reach

%% the beginning of the string ($)

for row=top:bot-1

pos = 1;

currow = row;

qc = bwt(currow:currow);

disp([‘looking for occurrence at row=’, num2str(currow), ‘ qc=’, qc]);

while(qc ~= ‘$’)

pos = pos+1;

%% count the rank of qc in the row

rowrank=1;

for j=1:currow-1

if (bwt(j)==qc)

rowrank=rowrank+1;

end

end

%% now jump to that row

currow=rowstart(char2baseidx(qc))+rowrank-1;

qc = bwt(currow:currow);

%disp([’ jumping to currow=’, num2str(currow), ‘ rowrank=’, num2str(rowrank), ‘ qc=’, qc]);

end

disp([‘The occurrence at row=’, num2str(row), ‘ is at pos=’, num2str(pos)]);

end

`


Enjoy !!



Written by M. //