The next few SOAPdenovo-related commentaries will dig deeper into data structure and algorithm of SOAPdenovo2. Here is a rough guideline of what to expect.
A new user, who likes to stay completely oblivious to the details of how SOAPdenovo2 works, has the following mental picture of the assembler -
Once you get somewhat familiar with the manual, you will find that the blackbox consists of four independent stages -
Why do we call them independent? The answer will be clear from the following picture -
Each stage of the program generates a set of output files, and the next stage reads them as input and does the next step of assembly. So, if you replace the output of a stage by files generated from a different program (say Velvet or Minia), SOAPdenovo2 will not be able to tell the difference and run to completion. That could be an interesting way to mix-and-match various assembly program.
More adventurous bioinformaticians will probably look into the code and see what the program does to go from one stage to another. You will find that the stages run through following functions.
prlRead2HashTable - parallel code for loading read libraries and splitting into kmers
chopKmer4read - Chops the reads into kmers and store them in KmerSets
thread_delow - Removes the kmers with low coverage
Mark1in1outNode - Marks the linear kmers
deLowCov - Removes the kmers with low coverage
removeSingleTips - single tips are removed
removeMinorTips - minor tips are removed
kmer2edges - Builds edges by combining linear kmers
prlRead2edge - parallel code for remapping reads on to edges
getUnlikeArc - get arcs that could be processed incorrectly
deleteUnlikeArc - delete this arcs
PE2Links - Updates connections between contigs based on alignment information of paired-end reads
Links2Scaf - Constructs scaffolds based on alignment information
In our blog, we posted many commentaries on building and resolving de Bruijn graphs and readers can understand the first three stages by going through them. The scaffold stage is novel (and quite elaborate) for SOAPdenovo. It will take us some time to go through the algorithm, but curious readers may find the following diagram helpful. It explains Links2Scaf function. Functions in red have many sub-functions not shown in the chart, whereas functions in black have few sub-functions.