Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Statistics for Matching Seeds

Most aligners used by bioinformaticians are not designed to efficiently handle extensive amounnt of insertion and deletions. Therefore large-scale analysis of PacBio reads were difficult until Chaisson and Tesler developed a new alignment algorithm. How does the algorithm work? To explain that, we need to first discuss few statistical properties of PacBio reads.

Ideally, an aligner for indel-heavy reads need to start with a seed (an error-free sequence that matches the reference) and then build the alignment around the seed. However, choice of seed length presents the programmer with a dilemma. If the seed is too long, we may not find any appropriate seed given how error-prone the PacBio reads are. If the seed is too short, it will have too many spurious hits with the referece. As a limiting case, a seed of length 1 (A, C, G, T) will have matches with the reference genome everywhere.

Chaisson and Tesler performed some statistical analysis to come up with the optimum size of the seed. The above chart is from their BLASR paper published in BMC Bioinformatics. It shows the distribution of sizes of error-free reads in a PacBio library.