In the previous post, we covered the basics of genetic analysis. The tools discussed there will go a long way to help you follow various scientific discussions involving SARS-CoV-2 genetic data. Today we will quickly review that post, and then look into different “strains” of SARS-CoV-2 coronavirus.
The biology of viral disease progression is as follows. The virus infects a person’s cell, uses its cellular machinery to replicate, kills the host cell and moves on to the neighboring cells. In the meanwhile, patient’s immune system fights back. Sometimes the human body wins, and other times the virus.
When the infected person coughs, he spreads droplets carrying viral particles outside his body. If another person inhales active viruses from those droplets, he becomes the new host. In our highly connected world with people flying from one continent to another, it is a matter of time before the virus spreads to every corner of the earth.
We can look at the same process through the prism of genetics. Each viral particle carries its own RNA barcode. When the parent virus “makes babies” by hijacking the cellular machinery of its host, those progenitors get the same RNA barcode. However, occasionally the copying process may introduce an error or two. If that happens, and then the baby virus with a different RNA infects another person, new RNA barcode starts circulating around.
Although the biological mechanism for viral transmission was known for over a century, only recently we got the technologies to sequence the RNA barcodes on a mass scale and confirm that people from different parts of the world were infected by the same virus. In the previous post we compared sequence from 29 patients from around the world using MUSCLE, and the viruses appeared almost identical.
Definition of “Strain”
We will introduce the word “strain” to describe viral particles with different RNA barcodes. As you can see, a viras genome of length 29000 and four possible nucleotides (A, T, G, C) at each position can have “29000 to the power 4” possible strains. The number is actually way larger, because nucleotides can also be deleted or inserted at different positions of the genome. On the other hand, mutations at some locations destroy the virus from functioning, and that will lower the count somewhat. On another point, we do not know whether different strains are more/less deadly or more/less infectious, and it is safe to assume that they are functionally equivalent unless evidence shows otherwise.
Two Primary Strains of SARS-CoV-2
With that introduction, let us look at various strains of SARS-CoV-2 and see what we can learn from them. For patient data, we use the public database at NCBI. At present, it has about 150 complete genomes from all over the world. Each sequence has the location of the patient and time-stamp for sample collection.
Another database called GISAID has over 1300 genomes, but it is private (i.e. password-protected) and people managing the database are incompetent. I requested access about a month back, and repeatedly tried to contact them by email, but never heard back. Another scientist helped me download some sequences from there through her account, and I will share summary of what I learned from analyzing that data.
Today we will focus on two strains discussed in the previous post. Before we begin, here is the convention for giving the location on these genomes. All locations are based on the corresponding position on “NC_045512” after alignment. This was the genome of Wuhan patient that we used in the previous post. Moreover, in our convention, the first nucleotide of the genome is counted as 1 and so on (not 0 as used by C or Python programmers).
Based on this convention, you will find two dominant strains with one having C at location 8782 and T at location 28144, whereas the other one having T at location 8782 and C at location 28144. Here is the list of both strains with full patient details for all 1302 samples from GISAID. The number 1100 on the third column means it is of the first type (875 samples), and 0011 means it is of the second type (421 samples).
I will later add the code used to get these results in github, but it is conceptually very simple. I load each genome as a string and check whether it has “TTTAGCCAGC” and “CTGTTTACCT” in it using regular expression search. If it does, the genome is of “1100” type. If instead it has “TTTAGTCAGC” and “CTGTTCACCT”, the genome is of “0011” type. In the Chinese paper discussing these strains, the first one (1100) is called L, and the second one is called R.
Illinois Patient - Both Strains?
Virus from one patient in Illinois had none of those sequences. On careful inspection, I noticed that it had “Y” in both of those variable position, and you can see it in the alignments posted in the previous post. What does that mean?
Y is not a real nucleotide, but sometimes the assembly programs use letters like Y, W, etc. to mark ambiguous nucleotides. That means the program could not resolve the sequence correctly.
Interestingly, the letter means T or C. One explanation is that the patient got infected by both strains. Since the sequencing machine see aggregate information, it tried to resolve the different sequences from two strains into one genome, and came up with ambiguities at those two places.
Story from Washington State
GISAID database has 327 sequences from the Washington state. I created a different table with their time-stamps, and you can find it here. You will notice that “0011” was circulating here since Jan. 19, which was the first patient from Wuhan. The measurements on Jan. 25 were on the same patient. Then on Feb. 20 onward, we see extensive spread of that strain. In the meanwhile, someone brought the other strain, and it showed up in the measurement on Feb. 27. By Mar. 10, its frequency increased substantially.
Origin of Two Strains
Going back to the data for all patients from all countries (and especially Wuhan, China), we see that the 1100 strain was the original one, and all samples collected in Wuhan in Dec. 2019 were of that type. Then the 0011 showed up on January 5, 2020, and its frequency increased over time. Samples from Beijing, for example, were all of 0011 form.
You will find many other patterns by going over the complete dataset with all 1302 patients here. See whether you can reconstruct them by writing your own code to analyze the publicly available full genomes at NCBI.