Bayesian Genome Assembly and Assessment by Markov Chain Monte Carlo Sampling

Although 5,386 bp bacteriophage genome is minuscule in size, readers may find concepts presented in the following paper useful.

Most genome assemblers construct point estimates, choosing only a single genome sequence from among many alternative hypotheses that are supported by the data. We present a Markov chain Monte Carlo approach to sequence assembly that instead generates distributions of assembly hypotheses with posterior probabilities, providing an explicit statistical framework for evaluating alternative hypotheses and assessing assembly uncertainty. We implement this approach in a prototype assembler, called Genome Assembly by Bayesian Inference (GABI), and illustrate its application to the bacteriophage X174. Our sampling strategy achieves both good mixing and convergence on Illumina test data for X174, demonstrating the feasibility of our approach. We summarize the posterior distribution of assembly hypotheses generated by GABI as a majority-rule consensus assembly. Then we compare the posterior distribution to external assemblies of the same test data, and annotate those assemblies by assigning posterior probabilities to features that are in common with GABIs assembly graph. GABI is freely available under a GPL license from https://bitbucket.org/mhowison/gabi.

In this context, an earlier paper by Rahman and Pachter is worth taking a look at.

CGAL: computing genome assembly likelihoods

Assembly algorithms have been extensively benchmarked using simulated data so that results can be compared to ground truth. However, in de novo assembly, only crude metrics such as contig number and size are typically used to evaluate assembly quality. We present CGAL, a novel likelihood-based approach to assembly assessment in the absence of a ground truth. We show that likelihood is more accurate than other metrics currently used for evaluating assemblies, and describe its application to the optimization and comparison of assembly algorithms. Our methods are implemented in software that is freely available at http://bio.math.berkeley.edu/cgal/ webcite.

We should also note that advanced assemblers like SPAdes have their built-in statistical evaluation methods for scaffolding step, which is usually the most error-prone.

‹»Our Bootstrapped Genome Paper Is Published« »bíogo: a simple high-performance bioinformatics toolkit for the Go language«›