To understand the assembly process, you need to visualize an eukaryotic genome. The simplest picture consists of a linear chain of unique regions joined by repeat/low complexity regions. Unique regions appear only once in the genome. Repeat regions appear many times in the genome. We can add other complications like haplotype difference to the above simple picture later, but they are secondary for most genomes.
The assembly process has three steps -
i) Assemble all pieces of unique regions. These are called contigs. ii) Connect them up. The end product is called scaffold. iii) Fill in the gaps in between with repetitive sequences.
The first step is relatively easy with Sanger reads, but gets complicated with Illumina reads. It is generally accepted as the most difficult step in a short-read assembly process. One needs to be careful about an aggressive assembler that connects pieces which should not be connected. The possibility of making this kind of error goes up, when the reads are short.
The second step is equally challenging, whether you have Sanger or Illumina reads. Basically, you link up various short fragments using mate pairs, multi-exon genes, and anything else you can find.
The third step is often not completed, when a genome paper is published. For example, if you download rat or cow genomes published several years back, you will find that the genomes were filled with NNNNNN sequences. Those NNNNNN regions represent unassembled repeats. The assembler made an estimate about how big the repeat is, but could not manage to fill up the repeats.
Let me give you some numbers about how a genome looks like at the publication stage. I am most familiar with the sea urchin genome, which has roughly 900Mb of nucleotides. When it was published, the genome contained 114K pieces (scaffolds). Three years after publication, Baylor released an update, where they brought down the number of pieces to 32K. The pieces are not all of equal size, and N50 is the most accepted statistics to judge the quality of the assembly.
As it appears to me, Pac Bio data is not good for de novo assembly of unique regions, unless you have tons of sequences. Illumina seems to be more well-suited. However, PacBio can be of tremendous help in joining the unique regions.
If you compare Pac Bio (5Kb reads) with Illumina 5Kb mate pairs for the purpose of joining reads, one big advantage of Pac Bio is that it also gives you the intermediate sequences. So, not only we connect up two fragments, we also get what is in between. However, one big caution here is that the connection made with Pac Bio will be ‘noisy’, whereas the Illumina contigs being connected contain high-quality data. Therefor, one needs to make the Pac Bio reads go through error-correction process (typically using Illumina) to turn them into good sequences.
We are really amateurs, when it comes to Pacific Bio sequences. How amateur are we? A few weeks back, we commented on our first PacBio library and it appears on the first page of google search with keywords ‘Pacific bio sequences’. So we guess we are no less informed than the rest of the world
Nevertheless, here we like to report that our efforts of using PacBio sequences in a genome assembly has been successful. PacBio reads were used solely for scaffolding short contigs previously assembled from Illumina reads. Pacific Bio reads are long (2Kb average in our case) but very noisy, whereas Illumina reads are short and clean. Therefore, one can develop a strategy of using the best of each technology into a combined effort.
Here are few small comments on what worked and what did not to help you avoid few learning steps.
a) Alignment tools like BLAST are of less use for PacBio data, because the reads are very noisy. Therefore, the company makes another alignment program named BLASR available. After going through registration and few other steps, we downloaded BLASR, but we were unable to install it due to a missing library. Being impatient as we were, we decided to simply map all K-mers (derived from Illumina) on the PacBio reads using Bowtie and proceed with further analysis. That seemed to have worked and we did not bother with BLASR.
b) The second thing you need to remember in using PacBio reads is that they are circular, and you cannot use them as plain Sanger reads while connecting contigs.
A picture will make this point clear. Here we show a genomic region and a possible PacBio read from the same genomic region -
We are now doing a type of analysis with the PacBio library, where we plan to correct PacBio reads using Illumina to build clean reads. This step is used primarily to build the repetitive regions of the genome. The results will be reported in a follow up commentary.
From the paper of A. Bashir et al. in Nature Biotechnology -
To produce the hybrid assembly, we first generated a consensus CDC contig set. Given the clonal nature of the CDC isolates (Supplementary Results), we split contigs from the minimal CDC assembly that were inconsistent with the remaining two isolates. If the split resulted in a subcontig of <1 kb in length, the subcontig was eliminated. We input the resulting 97 contigs in this set, along with 94,526 single-molecule reads from the PacBio RS with an average accuracy of 82.9% (Supplementary Fig. 2), into our hybrid assembly pipeline (Supplementary Fig. 3). The pipeline starts with five inputs: the CDC contigs, the raw 454 reads, the raw Illumina reads and the long and strobe PacBio reads. We used the PacBio long reads and strobe reads to scaffold the CDC contigs with the A Hybrid Assembler (AHA) scaffolding algorithm (Supplementary Methods). We also corrected the PacBio long reads with the 454 and Illumina reads using a previously described method2, and we then used these corrected reads to either correct errors or fill gaps in the AHA scaffolds at various points in the pipeline. We then mapped the PacBio long reads back to the gap-filled scaffolds overturning low-quality base calls, thus ensuring that the final sequence came from a single clonal source. The computational performance of the assembly pipeline is given in Supplementary Table 1.
We will write more on the algorithm.
Chen Shan: I am not sure it is right thing to do some coding in a vacation. (See this tweet, not sure if I would agree.) Anyway, before the vacation, I decided to organize whole bunch of random papers I collected in the last few years in my laptop. I eventually felt that I should read some of some theoretical papers about genome assembly again. I grabbed Gene Meyer’s paper, and started to think about the problem about constructing unitigs (high-confident contigs). Before I tried to read the paper in detail, I just felt maybe that it was useful to write some quick code to check out what real data looked like. I started writing some code visualizing the local overlapping and generating the global overlapping graph. It was pretty straight forward and quite inspiring from the visualization. The visualization motived me to write more ad-hoc code in IPython to go beyond generating simple visualization during the vacation. I started to implement a very simple greedy algorithm to connect the input DNA fragments. Eventually, I found I can atucally get the whole genome assembly right!! This Notebook shows the work step by step toward a simple genome assembler for long read data using IPython and Python.