Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

File Format

The following files are generated by the sequencer (X is some arbitrary file name) -

Top_folder: X.mcd.hd5 X.metadata.xml Analysis_Results: X-02.log X-03.log X-04.log X.bas.h5 X.ccs.fasta X.ccs.fastq X.fasta X.fastq X.sts.csv X.sts.xml

What do they contain? The top folder contains raw data in hd5 format, which is PacBio’s native data format. The sequencer also performs introductory analysis to generate some results, and those results are stored in Analysis_Results folder.

One important file in the Analysis_Results folder is x.ccs.fasta. This is the one you need to first look into oo gain some confidence into the generated data. Tt contains the cleanest assembled reads from the sequencing run. If you have a reference genome, you need to first align the ccs reads on to your reference genome. If that step fails, something went wrong with the sequencing run.

In our case, the size of ccs file is only 1% of the raw fasta sequence files. So, lot more sequences have not been included in ccs file, and to properly use them, you need to use SMRT software from PacBio. You will also find PacBioToCA tools useful.

Ideally one would like to run a de novo assembly with Pac Bio data, but that is very hard because of high error rate of the raw sequences. So, researchers are looking into two other possibilities -

i) Doing de novo assembly using Illumina or other short read technologies, and using PacBio to do scaffolding/extending,

ii) Running an error-correction routine on Pac Bio data using Illumina sequences, and then use those error-corrected reads for assembly. This is where PacBioToCA can help. For example, you can use Velvet to generate contigs from Pac Bio data, run error correction and then use error-corrected Pac Bio to improve Velvet assembly.