For those unfamiliar with PacBio reads, the above figure shows a Clustal alignment of two reads with a genomic region that has already been assembled (marked as REAL). Please click on it to see in full.
Three points to make note of -
(i) PacBio reads mostly contain indel errors as we discussed earlier,
(ii) The random positions of indels in different PacBio reads are location- independent,
(iii) You will also notice long stretches of error-free regions. Mark Chaisson, who wrote BLASR, mentioned that in his talks he often shows two random series. One of them is generated by tossing coins many times and the other one is generated by having very few contiguous stretches of heads and tails compared to the one generated by coin toss. When he asks which one is generated through coin toss, people often identify the other one, because their expectation of how a random sequence should look like is quite different from an actual random sequence.
What can we do with these reads?
Main advantange of PacBio is the length of reads. They are often greater than 3k-4k and are getting longer with RS II. If you are assembling genome, you can use them to scaffold the Illumina contigs.
Secondly, if your PacBio coverage is high, you can align PacBio against PacBio and take consensus to assemble a genome from PacBio alone. That is when the completely random distribution of indels comes handy.