Applied Biosciences (now Life Technologies) commercialized a novel sequencing technology, which gave out the reads in ‘color space’ bases instead of nucleotides - ‘A’, ‘T’, ‘C’, ‘G’s. The SOLiD machine reports transitions between neighboring nucleotide pairs. When one looks at the pairs of neighboring nucleotides, the number of reportable combinations increase from 4 (A, C, G, T) to 16 (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT). To simplify reporting, the SOLiD machines elegantly reduce the possible combinations from 16 to 4 based on the following table:
You see that four combinations – AA, CC, GG, TT – are all reported as 0. Another four combinations – AC, CA, GT, TG – are reported as 1, and so on. We shall elaborate on what kind of thought went behind choosing the numbers in the above table. Also we shall show that this reduction of complexity comes at a cost. First, let us explain how the color space works.
How to convert sequences to color space?
Let us choose a specific example (ATGGTGGTTGTTA). The sequence will be converted in the following manner -
In the color code table shown earlier, the first transition from A-T is noted as red or 3. The second transition from T-G is noted as green or 1. Continuing in the same manner, the entire sequence ATGGTGGTTGTTA will be converted to 310110101103. That is the color space representation of the sequence.
If we choose another example – GCAACAACCACCG, we soon discover that it also converts to 310110101103. Two other sequences – CGTTGTTGGTGGC and TACCACCAACAAT also have the same representation. Do you see the problem here? The nucleotide space sequence of a color space representation is not unique. Every color code data can be converted to nucleotide space in exactly four different ways. This is due to reduction of complexity from 16 to 4, while choosing the numbers in color code table.
To avoid ambiguity, SOLiD machines report the first nucleotide along with the remaining color space representation. Therefore, the four sequence will be given as A310110101103, C310110101103, G310110101103 and T310110101103. Once the first nucleotide is known, color space data can be converted to nucleotide space in an unique manner. However, we will soon see that this conversion back to nucleotide space is inefficient and eliminate the primary advantage of SOLiD sequencing. Analysis of color space data needs to be done in color space.
How do we compute reverse complement in color space?
The reverse complement of the original sequence (ATGGTGGTTGTTA) is (TAACAACCTCCAT). Here we convert it into color space.
You can see that this color space representation is the exact opposite of the original color space data (310110101103). This is true for all sequences. That is how the numbers in the conversion table were chosen.
Low complexity sequences also show two or four-fold degeneracy in color space. For example,
If we introduce one nucleotide SNP in the original sequence, its color space representation gets two changes. You can see that by comparing the following example with the original sequence. The modified region is marked in red.
One base change in the color space, on the other hand, can dramatically change its nucleotide space representation. For example, the sequence A310110201103 with one color space difference from the original 310110101103 translates back to ATGGTGGAACAAT. Do you see how different it is from the original ATGGTGGTTGTTA.
We provide two Perl scripts to convert any other sequences from color space to nucleotide space or from nucleotide space to color space.
The last example shows the greatest advantage of color space sequencing. Sequencing machines are error prone. Among all errors, single nucleotide changes are most common. This is a problem for all resequencing projects, because their primary goal is to identify single nucleotide changes from a reference sequence. Often it is not clear whether the single nucleotide change observed after sequencing is due to sequencing error, or whether it is a genuine difference from the reference. When one works with color space sequencing, true SNPs can be easily distinguished from sequencing errors. A real SNP marks two changes in color space from the reference genome, whereas a sequencing error does not translate to anything close to reference genome.
We note that color space data from SOLiD machines often contain single errors. If one converts all sequences to nucleotide space, the converted data do not look anything remotely close to real sequence and therefore all subsequent analysis may become highly error prone. Therefore, it is advisable to perform as much computational analysis in color space as possible. It is better to convert the reference data to color space than SOLiD sequences to nucleotide space.
Disadvantage
The above constraint becomes the primary disadvantage of color space data. Over the years, many analysis tools had been written for nucleotide space sequences. They often do not work in the color space, and moreover, at the end of the day, we have to get data converted to nucleotide space and it may not always be possible. For example, one can perform de novo assembly of an unsequenced region in color space, but how does he get the real sequence of the unsequenced region?