The error distribution in PacBio sequencing is unusual, because it consists mostly of insertions and deletions and only a few mismatches. On the plus side, the errors are always uniformly distributed and they are uncorrelated between two different sequencing reads representing the same genomic region. Therefore, it is possible to compare the reads to come up with the correct sequence. The difficulty lies in finding consensus through insertions and deletions, and how that is done will be discussed in the later sections.
To make sure the point is remembered, we will write down three properties of error distribution in PacBio reads.
Property 1. The error rate in reads is usually around 15%. It has been reported in BLASR paper by Chaisson and Tesler.
Property 2. The errors are highly uniform within the reads. This is unlike other technologies, where a preponderance of errors near the ends of reads is seen.
Property 3. Those 15% of errors constitute of 11% insertions, 4% deletions and 1% mismatches.
Overwhelming presence of indels makes bioinformatic analysis of PacBio reads difficult.