Over the last few weeks, I received many questions related to genetics of the new coronavirus. Some of them are about genome-based tracking of this virus by the nexttrain team. Others are on claims about two strains of the virus (“L” and “R”), whether the virus is mutating rapidly into more deadly form, how the tests are made, how scientist know that it came from bat or pangolin and finally whether it is a bioweapon.
The last question is worrying a lot of people. The internet is rife with speculations based on comments from “credentialed” experts arguing one way or other. In online discussions, many people seem interested in understanding the science behind such claims instead of trusting the credentials, and therefore I decided to write this series of posts.
In these posts, I will give you the tools so that you can do your own analysis and follow various scientific discussions more closely. You will learn where you can get publicly available genomic data, which free computer programs you can use to perform your analysis and how you can interpret the results. After we cover the basics, we will be ready to discuss competing arguments on bioweapons.
These posts will be at a level that even high-school students can follow. In fact, every summer I teach groups of high-school students the exact bioinformatic techniques. If anything is unclear, please ask me at twitter and I will try to explain my best.
What is a Virus?
Viruses are small particles made of DNA/RNA and proteins. Although they contain similar kind of chemicals as in living cells, viruses cannot “live” on their own. They need to get inside living cells and hijack their machinery to replicate. Then they kill the host cells and spread into neighboring cells to proliferate further.
A New Kind of Microscope
Viruses are tiny and cannot be seen using light microscopes. Even if you get a more powerful microscope, you will not be able to distinguish between two similar types of viruses (e.g. SARS coronavirus versus Wuhan coronavirus). New types of microscopes are needed.
Fortunately DNA sequencing technology advanced rapidly over the last two decades and can work as such powerful microscopes. DNA is a polymeric molecule that can be interpreted as barcode for each virus particle. DNA sequencing allows us to determine the DNA barcode of a collection of viruses. Moreover, when a virus replicates, each “baby” gets a copy of its parent’s DNA. Therefore, we can use the similarities between two DNA sequences to tell how closely related their parents, grandparents or great grandparents were.
Please note that coronaviruses are RNA viruses. That means they carry RNA instead of DNA as their genetic materials, but otherwise the above description remains true.
Download Data from NCBI
That is enough of theory. Let us look into actual data from NCBI, a public repository of DNA sequences maintained by the US government. Here you can see the DNA sequence of Wuhan coronavirus (SARS-CoV-2) obtained from one of the earliest patients in Wuhan, China. The page contains plenty of information, but if you click on the “FASTA” link near the top, you can get to the actual sequence. Here you can get the sequence of the virus that infected the first US confirmed patient, who flew into Seattle from Wuhan during mid-January.
Alignment Program - MUSCLE
We need to find out how different the sequences are, and for that we will use an “alignment” program. MUSCLE is my favorite, because it is fast, accurate, easy to use and, most importantly, works in all major operating systems.
First, copy and paste the viral sequences (including header) from the Wuhan patient (identifier NC_045512) and Seattle patient(identifier MT020881) in a file called “seq.fasta”. I already did the work for you and saved the sequence here in github.
Then, align the sequences by running MUSCLE with the following command.
muscle -in seq.fasta -out seq.aln -clw
You will get this output. Each block shows the results of comparison of portions of the two sequences with each row containing data from a patient. You can see the identifiers NC_045512 and MT020881 representing Wuhan and Seattle patient respectively. If sequences for virus from both patients are identical, the lines are marked with an asterix at the bottom. In this comparison, the sequences are almost identical with hardly one or two differences. However I will give you a secret. Those two patients got infected by two strains of the virus, and you can find the differences in the alignment. Let us revisit this after the next exercise.
Aligning Data from 29 Patients
Next we will align sequences from more patients. Please note that all data and analysis related to this discussion are available in this github repository.
If you found cutting and pasting of sequences from NCBI a bit tedious, there is a better way. NCBI allows bulk download of SARS-CoV-2 sequences from this page. You need to choose the sequences you like and click download to get a file with all FASTA sequences. This page has many more options, and I will describe them in a later post. For the time being, I already downloaded 29 sequences at this link for your convenience.
You can download the raw data file by clicking “Raw” in github. Then use MUSCLE to align in the same way as before. The alignment will look like this file. This time you will notice more difference between the sequencesthan the two-person comparison. Let us check the most interesting ones among them.
In the following images, I copied two selected regions from the entire alignment. In the images, each row (MT020781, MN988713, etc.) displays viral sequences from a different patient. You can learn about the details of those patients from this file. The sequences are mostly identical in all patients, but pay attention to the columns marked with red arrows where the sequences are different.
You notice that the virus from some patients have C in those marked positions and others have T. Howevr, there is something more interesting. You will find that if the first position is C in some patient, the second position is always T and vice versa. None of them have any other combination (C/C, T/T or other nucleotides).
Through this simple analysis that you can run on your own, you discovered two strains of SARS-CoV-2 that the scientists are talking about. Re-examine the Seattle versus Wuhan comparison we did earlier, and you will find that two patients were indeed infected by two different strains.
In the next post, we will go over the other differences in the sequences and their implications.