Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Assembling Color Space Reads

Researchers interested in assembling color-space reads experience two challenges. Firstly, most tools are written for nucleotide-space data and they cannot be readily applied to color-space libraries. Secondly, even though color-space libraries contain larger amount of data, per-read error tends to be more than nucleotide-space sequences. Those errors create distortions in the de Bruijn graph structures.

Is Conversion to Nucleotide Space Prudent?

If the assembly program does not work in color-space, should the researcher convert all reads to nucleotide-space? The answer is no, because direct conversion of an color-space read with SNP to nucleotide-space introduces errors in the converted read from the point of SNP all the way to its 3’ end. No such error is introduced, when nucleotide-space reads are converted to color-space. Therefore, even if the researcher is doing hybrid assembly with mixture of SOLiD and Illumina libraries, all reads should be converted to color-space prior to assembly.

Pseudo basespace

Most programs are written for nucleotide space data and therefore converting all reads to color-space does not solve the lack of software issue. However, there is a simple solution - pseudo basespace. Let us say a researcher has a color space sequence that he likes to align with a reference using CLUSTAL. One way to get around the tools problem is to replace all color-space sequences with (0=A, 1=C, 2=G, 3=T). This is known as pseudo basespace, because the reads are intrinsically in color-space, but they make the programs believe that they are nucleotide-space data.

Above easy solution creates one difficulty however. Many tools do their analysis on both strands and to do that, they derive reverse complements of various sequences. Given that the reverse complementing rules are different between nucleotide-space and color-space data, this issue can only be handled by hacking the code of the bioinformatics program. Such hacking is relatively easy, because most assembler and other codes have a separate ‘reverse complement’ function, and the modifying that function is enough to make the code suitable for color-space data in pseudo-basespace format.

Error Correction

SOLiD reads contain extensive amount of color-space errors. Due to large volume of erros, they need to corrected before performing assembly of the reads. The error correction is usually done by counting the k-mers in an entire library and replacing low frequency k-mers with closely related high-frequency k-mers.

de Bruijn Graph of SOLiD Reads

To get a mental picture of how the de Bruijn graph of a SOLiD library would look like, the readers can follow the same procedure as in section 2 keeping one difference in mind. It is that the reverse complement of a k-mer is the exact opposite of the forward strand. Apart from that, all of our previous discussions should remain valid for color space reads.


Web Statistics