Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Shotgun Assembly Approach

A typical eukaryotic chromosome is millions of nucleotides long. No sequencing technology of present time has the ability to decode the entire sequence in one shot. Therefore, genome sequencing requires additional strategies beyond the use of sequencing instruments. In a shotgun approach, the chromosome is chemically parsed into many small fragments, and each fragment is decoded by the sequencing instruments. Subsequently, a specialized computer program (genome assembler) merges all small pieces together to computationally rebuild the genome sequence.

How do the de Bruijn assemblers really assemble large genomes from short reads?

Step 1. A giant de Bruijn graph is constructed from all reads, but mate pair information is ignored first. This is because de Bruijn graphs cannot preserve sequence space correlations. So, there is no way to naturally incorporate mate pair data within de Bruijn graph.

Step 2. The de Bruijn graph is cleaned from all possible sequencing errors. That means all hanging branches and loops supported by few reads are purged.

Step 3.Some parts of the giant graph can be linearly traversed. Efforts are made to simplify those linear parts as much as possible. After this step, we may get few contigs longer than 1Kb from non-repetitive regions of the genome. Other parts of de Bruijn graph having junctions (representing repetitive regions) cannot be simplified. Therefore, contig size never gets too large at this stage for complex eukaryotic genomes.

Step 4.Mate pair information is introduced next. An attempt is made to find linear paths through the graph that satisfy restrictions imposed by read pair distance on two ends. A good example of how that is done is provided in the Velvet paper. Please jump to their Breadcrumb: Resolution of repeats with short read pairs section.

The contig/scaffold size after Step 4 depends on repeat structure of the genome and mate pair insert sizes. Some less complex genomes may be derived with insert sizes 5 Kb long. Other complex genomes may require inserts of size 20 Kb. This is where the three red paragraphs become handy. In our original description of de Bruijn graph, we started with building de Bruijn graph for the genome itself. I believe (although I have no calculation to prove this assertion) if the de Bruijn graph of a genome is simple and linear with K-mer size=5000 nt, one should be able to assemble it from insert sizes of 5Kb length. On the other hand, a complex genome with many 5Kb repeat regions will need larger insert sizes (let’s say 20Kb). Please do not use my word on the above matter and DYODD to decide what to do for a sequencing project.