From the paper by C. S. Chin et al. in Nature Methods -
The principle (Fig. 1) and workflow (Fig. 2) of HGAP consist of several well-defined steps. (1) Select the longest sequencing reads as a seeding sequence data set. (2) Use each seeding sequence as a reference to recruit shorter reads, and preassemble reads through a consensus procedure. (3) Assemble the preassembled reads using an off-the-shelf assembler that can accept long reads. (4) Refine the assembly using all initial read data to generate the final consensus that represents the genome. Optionally, minimus2 or similar tools can be used to connect the contigs from step (3) to further improve the continuity of the assembly and remove spurious contigs due to assembly or sequencing errors.
Background Previous approaches for pure PacBio de novo assembly, like pacBioToCA and RS_Allora_Assembly_EC, require two libraries: typically one 10 kb library for continuous long reads (CLR), and one 1 kb library for circular consensus sequencing (CCS) reads. The CCS reads are aligned against the long reads in order to take a consensus and generate very long, highly accurate reads. In addition to requiring two library preps, the CLR plus CCS approach can require many SMRT® Cells to generate sufficient coverage of CCS data.
HGAp simplifies the sample preparation process by requiring only a single library prep, which reduces by half the input DNA required. HGAp also reduces the number of SMRT Cells required by using single-pass reads rather than CCS. For example, assemblies using HGAp may use 1/4 to 1/2 as many SMRT Cells as assemblies using CCS. As a result, microbial genome assembly can be performed with less input DNA, with less sample preparation, at lower cost, and in less time. Larger genomes become more feasible because the useful throughput increases.
Methods HGAp consists of a series of well-defined steps:
Generate at least 60x coverage (ideally at least 100x coverage) of your target genome using a long insert library with 90 or 120 minute movies Choose a subread length threshold such that subreads above the threshold provide about 20x coverage of the genome. Filter the base files using the subread length threshold. The file of long subreads constitutes the “seeding reads”. Use BLASR to map all data against the seeding reads. Use BLASR’s “-m 4″ option. Set the “bestn” to allow multiple mapped locations; a value in the range of 5 to 12 seems to work well for 20x coverage. Process the mapped data, trimming to the boundaries of the seeding reads minus 100 bp. Take a consensus of the pre-assembled reads using a partial order alignment or other highly accurate method. Feed the pre-assembled reads into an overlap-layout-consensus assembler that can handle very long reads. Take the output of the assembler, and treat it as a reference genome. Map all raw data against it using BLASR. Trim at the contig boundaries and take the consensus using Quiver to “polish” the assembly and remove errors. </i>
We will write more on the algorithm.