Readers may enjoy a new paper posted at biorxiv by Ilia Minkin and Paul Medvedev. It shows a method for aligning against multiple closely-related genomes that is order(s) of magnitude faster than the competing approaches. In bioinformatics, such dramatic improvement in speed is not seen often.
On the dataset consisting of 2 mice, SibeliaZ is more than 10 times faster than Cactus, while on 4 mice SibeliaZ is more than 20 times faster. On the datasets with 8 and 16 mice, SibeliaZ completeed in under 8 and 12 hours, respectively, while Cactus did not finish (we terminated it after a week).
The authors use de Bruijn graphs to find and extend anchor multiedge seeds into collinear blocks in a greedy fashion. Then they use these collinear blocks to speed up the alignment process. This algorithm is most effective, when the genomes under consideration are closely related in the evolutionary sense.
Recently there has been a lot of activity on compact storage and representation of multiple closely related genomes, because it has become inexpensive to sequence many related organisms and then perform additional research on the collection. Interested readers may also take a look at the following papers - VARI, Rainbowfish and Prefix-Free Parsing for Building big BWTs. We will present a more complete survey in a later post.