In this week’s commentary in the membership section, we reviewed the recent advances in the genome assembly field. One paper mentioned there is an excellent PLOS Compbio. review on scaffolding by Jay Ghurye and Mihai Pop. I will skip over the discussion on various long-read technologies and mention a topic with the potential to make substantial improvement in genome assembly.
Ghurye and Pop argued that leveraging synteny between the genomes of different organisms had been a “missed opportunity” in the scaffolding field. I also experienced this gap, while trying to assembly an electric fish genome. Couple of years back, I was working on the assembly of a second electric fish after the publication of the electric eel genome. I noticed a dearth of information among the published articles on genome assembly on using syteny to improve the scaffolding quality.
Using synteny gets easier as we have access to more and more genomes. Moreover, as Ghurye and Pop argued, using synteny is the only approach that scales to the genome size, whereas the other technologies are limited by their respective read lengths. Then why do researchers working in the assembly field not pay attention to this method?
The reasons are historical. Firstly, genome assembly is a field dominated by computer-scientists, and currently there is a big gulf between them and the biologists. Computer scientists like to treat genome assembly as a pure string manipulation problem with no interference from those “pesky biologists”. In the division of labor, the assembled genome is passed on to the biologists for annotation and other downstream analysis. This approach worked well in the past, because the availability of genomes from various organism was sparse. With reduced cost of Illumina and Pacbio sequencing, that sparse space is getting filled rapidly.
Secondly, due to financial reasons, the genome assembly field had been dominated by bioinformaticians working on the human genome. In fact, there has been active research on finding information from multiple human genomes, and we plan to review variation graph and several other recent algorithmic developments in a commentary to be published soon in the membership section. However, once again, the computer scientists enjoy comparison of multiple human genomes, because it can be translated to another string manipulation problem. In case of using synteny between multiple organism, the extra step of converting genes into proteins adds to the complexity.
Rewards in Using Synteny
Synteny information is not only useful for improving the quality of assembly, but may turn out to be rewarding in the discovery new biology. The earliest example of this came from a 2004 Nature paper by Kellis et al. titled “Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae”. The authors discovered a major mechanism for the creation of new genes in the yeast lineages..
Bioinformatics Beyond Genome Assembly
Beyond genome assembly, multi-organism comparative approach is even less popular in the bioinformatics tools developed by statisticians and computer scientists. Take RNAseq for example. In our electric eel paper we used RNAseq comparison of multiple evolutionarily distant electric fish organisms to find shared pathways. Much of the analysis had to be done manually, because the standard statistical tools were too poor to recognize statistically significant patterns from multiple organisms.