Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Error Distribution in PacBio Reads

Sequencing instrucments from Pacific Biosciences generate long reads, but the reads contain considerable amount of noise.

The error distribution in PacBio sequencing is unusual, because it consists mostly of insertions and deletions and only a few mismatches. On the plus side, the errors are always uniformly distributed and they are uncorrelated between two different sequencing reads representing the same genomic region. Therefore, it is possible to compare the reads to come up with the correct sequence. The difficulty lies in finding consensus through insertions and deletions, and how that is done will be discussed in the later sections.

To make sure the point is remembered, we will write down three properties of error distribution in PacBio reads.

Property 1. The error rate in reads is usually around 15%. It has been reported in BLASR paper by Chaisson and Tesler.

Property 2. The errors are highly uniform within the reads. This is unlike other technologies, where a preponderance of errors near the ends of reads is seen.

Property 3. Those 15% of errors constitute of 11% insertions, 4% deletions and 1% mismatches.

Overwhelming presence of indels makes bioinformatic analysis of PacBio reads difficult.

The following discussions are based on Illumina reads, because they come in high volume.