Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

BLASR

The error distribution in PacBio sequencing is unusual, because it consists mostly of insertions and deletions and only a few mismatches. On the plus side, the errors are always uniformly distributed and they are uncorrelated between two different sequencing reads representing the same genomic region. Therefore, it is possible to compare the reads to come up with the correct sequence. The difficulty lies in finding consensus through insertions and deletions, and how that is done will be discussed in the later sections.

To make sure the point is remembered, we will write down three properties of error distribution in PacBio reads.

Property 1. The error rate in reads is usually around 15%. It has been reported in BLASR paper by Chaisson and Tesler.

Property 2. The errors are highly uniform within the reads. This is unlike other technologies, where a preponderance of errors near the ends of reads is seen.

Property 3. Those 15% of errors constitute of 11% insertions, 4% deletions and 1% mismatches.

Overwhelming presence of indels makes bioinformatic analysis of PacBio reads difficult.

<A href=http://www.ncbi.nlm.nih.gov/pubmed/23129296>A paper by Y. Ono et al.</A> reported developing a simulator for PacBio reads. The simulator PBSIM is publicly available from google repository (<A href=http://code.google.com/p/pbsim/>http://code.google.com/p/pbsim/</A>) under the GNU GPL v2 license.

Most aligners used by bioinformaticians are not designed to efficiently handle extensive amounnt of insertion and deletions. Therefore large-scale analysis of PacBio reads were difficult until Chaisson and Tesler developed a new alignment algorithm. How does the algorithm work? To explain that, we need to first discuss few statistical properties of PacBio reads.

Ideally, an aligner for indel-heavy reads need to start with a seed (an error-free sequence that matches the reference) and then build the alignment around the seed. However, choice of seed length presents the programmer with a dilemma. If the seed is too long, we may not find any appropriate seed given how error-prone the PacBio reads are. If the seed is too short, it will have too many spurious hits with the referece. As a limiting case, a seed of length 1 (A, C, G, T) will have matches with the reference genome everywhere.

Figure

Chaisson and Tesler performed some statistical analysis to come up with the optimum size of the seed. The above chart is from their BLASR paper published in BMC Bioinformatics. It shows the distribution of sizes of error-free reads in a PacBio library.

Based on their analysis, Chaisson and Tesler came up with 15nt as the optimum size of the seed. They noticed another pattern in their data. In regions representing correct alignment of a PacBio read with the reference sequence, multiple 15nt seeds could be found in near proximity.

As a first step, BLASR compares a PacBio read with a reference sequence and searches for matching 15-mers. Given that we are looking for exact matches, the search can be done efficiently by using Burrows-Wheeler transform.

Due to the presence of repeats in a genome, the above step is likely to give too many seeds from various incorrect regions. To reduce the list, BLASR picks the regions with multiple matching 15-mers seeds. More specifically, BLASR finds regions with at least 10 anchors of size > 15nt, and then further aligns reads with those genomic regions using a more precise alignment method (e.g. Smith-Waterman).

Can BLASR be used to align genome of one organism with another? Please keep in mind that BLASR strategy makes use of the fact that error rate in PacBio reads is uniform. When we compare two genomes, the distribution of errors changes from one segment to another. Intergenic regions have higher degree of variation than the genomic or cis-regulatory regions.


Web Statistics