Applied Biosciences (now Life Technologies) commercialized a novel sequencing technology, which gave out the reads in ‘color space’ bases instead of nucleotides - ‘A’, ‘T’, ‘C’, ‘G’s. The SOLiD machine reports transitions between neighboring nucleotide pairs. When one looks at the pairs of neighboring nucleotides, the number of reportable combinations increase from 4 (A, C, G, T) to 16 (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT). To simplify reporting, the SOLiD machines elegantly reduce the possible combinations from 16 to 4 based on the following table:
You see that four combinations – AA, CC, GG, TT – are all reported as 0. Another four combinations – AC, CA, GT, TG – are reported as 1, and so on. We shall elaborate on what kind of thought went behind choosing the numbers in the above table. Also we shall show that this reduction of complexity comes at a cost. First, let us explain how the color space works.
How to convert sequences to color space?
Let us choose a specific example (ATGGTGGTTGTTA). The sequence will be converted in the following manner -
In the color code table shown earlier, the first transition from A-T is noted as red or 3. The second transition from T-G is noted as green or 1. Continuing in the same manner, the entire sequence ATGGTGGTTGTTA will be converted to 310110101103. That is the color space representation of the sequence.
If we choose another example – GCAACAACCACCG, we soon discover that it also converts to 310110101103. Two other sequences – CGTTGTTGGTGGC and TACCACCAACAAT also have the same representation. Do you see the problem here? The nucleotide space sequence of a color space representation is not unique. Every color code data can be converted to nucleotide space in exactly four different ways. This is due to reduction of complexity from 16 to 4, while choosing the numbers in color code table.
To avoid ambiguity, SOLiD machines report the first nucleotide along with the remaining color space representation. Therefore, the four sequence will be given as A310110101103, C310110101103, G310110101103 and T310110101103. Once the first nucleotide is known, color space data can be converted to nucleotide space in an unique manner. However, we will soon see that this conversion back to nucleotide space is inefficient and eliminate the primary advantage of SOLiD sequencing. Analysis of color space data needs to be done in color space.
How do we compute reverse complement in color space?
The reverse complement of the original sequence (ATGGTGGTTGTTA) is (TAACAACCTCCAT). Here we convert it into color space.
You can see that this color space representation is the exact opposite of the original color space data (310110101103). This is true for all sequences. That is how the numbers in the conversion table were chosen.
Simple sequences
Low complexity sequences also show two or four-fold degeneracy in color space. For example,
SNP
If we introduce one nucleotide SNP in the original sequence, its color space representation gets two changes. You can see that by comparing the following example with the original sequence. The modified region is marked in red.
One base change in the color space, on the other hand, can dramatically change its nucleotide space representation. For example, the sequence A310110201103 with one color space difference from the original 310110101103 translates back to ATGGTGGAACAAT. Do you see how different it is from the original ATGGTGGTTGTTA.
We provide two Perl scripts to convert any other sequences from color space to nucleotide space or from nucleotide space to color space.
Advantage of Color Space
The last example shows the greatest advantage of color space sequencing. Sequencing machines are error prone. Among all errors, single nucleotide changes are most common. This is a problem for all resequencing projects, because their primary goal is to identify single nucleotide changes from a reference sequence. Often it is not clear whether the single nucleotide change observed after sequencing is due to sequencing error, or whether it is a genuine difference from the reference. When one works with color space sequencing, true SNPs can be easily distinguished from sequencing errors. A real SNP marks two changes in color space from the reference genome, whereas a sequencing error does not translate to anything close to reference genome.
We note that color space data from SOLiD machines often contain single errors. If one converts all sequences to nucleotide space, the converted data do not look anything remotely close to real sequence and therefore all subsequent analysis may become highly error prone. Therefore, it is advisable to perform as much computational analysis in color space as possible. It is better to convert the reference data to color space than SOLiD sequences to nucleotide space.
Disadvantage
The above constraint becomes the primary disadvantage of color space data. Over the years, many analysis tools had been written for nucleotide space sequences. They often do not work in the color space, and moreover, at the end of the day, we have to get data converted to nucleotide space and it may not always be possible. For example, one can perform de novo assembly of an unsequenced region in color space, but how does he get the real sequence of the unsequenced region?
Is Conversion to Nucleotide Space Prudent?
If the assembly program does not work in color-space, should the researcher convert all reads to nucleotide-space? The answer is no, because direct conversion of an color-space read with SNP to nucleotide-space introduces errors in the converted read from the point of SNP all the way to its 3’ end. No such error is introduced, when nucleotide-space reads are converted to color-space. Therefore, even if the researcher is doing hybrid assembly with mixture of SOLiD and Illumina libraries, all reads should be converted to color-space prior to assembly.
Pseudo basespace
Most programs are written for nucleotide space data and therefore converting all reads to color-space does not solve the lack of software issue. However, there is a simple solution - pseudo basespace. Let us say a researcher has a color space sequence that he likes to align with a reference using CLUSTAL. One way to get around the tools problem is to replace all color-space sequences with (0=A, 1=C, 2=G, 3=T). This is known as pseudo basespace, because the reads are intrinsically in color-space, but they make the programs believe that they are nucleotide-space data.
Above easy solution creates one difficulty however. Many tools do their analysis on both strands and to do that, they derive reverse complements of various sequences. Given that the reverse complementing rules are different between nucleotide-space and color-space data, this issue can only be handled by hacking the code of the bioinformatics program. Such hacking is relatively easy, because most assembler and other codes have a separate ‘reverse complement’ function, and the modifying that function is enough to make the code suitable for color-space data in pseudo-basespace format.
Error Correction
SOLiD reads contain extensive amount of color-space errors. Due to large volume of erros, they need to corrected before performing assembly of the reads. The error correction is usually done by counting the k-mers in an entire library and replacing low frequency k-mers with closely related high-frequency k-mers.