In NGS experiments, when the researchers encounter issues with genome assembly or analysis, they go back to the raw data composed of sequencing reads. In a latest preprint submitted to zenodo, Steven C. Quay did exactly that for a seminal paper and concluded - “The alternative conclusion is that this sample was not a fecal specimen but was contrived. The data cannot, however, distinguish between a non-fecal specimen that came from true field work on the one hand and a specimen created de novo in the laboratory on the other hand.” This is no simple matter, because the entire world had been running like headless chicken for the last two years relying on the genome assembly submitted in the paper.
Our entire understanding and response for the Covid pandemic relied on a number of early papers coming from China. One group of papers discussed the timeline for the origin of the pandemic, and those claims were challenged by a number of researchers. In “Important Covid-related Datasets Disappeared from NCBI SRA”, we wrote about key raw read datasets being erased from NCBI to give the impression that the disease started in December from a Wuhan seafood market. Researcher Bloom recovered some of those erased data from various caches and determined an earlier starting time.
Another set of foundational papers from China gave us the sequences of SARS-CoV-2 and its related viruses claimed to be from bats. That was another fabrication according to this new preprint from Quay. Its title says it all - “The seminal paper from the Wuhan Institute of Virology claiming SARS-CoV-2 probably originated in bats appears to contain a contrived specimen, an incomplete and inaccurate genomic assembly, and the signature of laboratory-derived synthetic biology”. Key claims of the new preprint -
The RaTG13 specimen was not a bat fecalspecimen, based on a comparison of the relative bacterial and eukaryotic genetic material in the purported fecal specimen to nine authentic bat fecal specimens collected in the same field visits as RaTG13 was collected by the Wuhan laboratory, run on the same Illumina instrument (id ST-J00123), and published in a second paper in February 2020. While the authentic bat fecal samples were, as expected, largely bacterial (specifically, 65% bacteria and 12% eukaryotic genetic sequences), the purported RaTG13 specimen had a reversed composition, with mostly eukaryotic genes and almost no bacterial genetic material (0.7% bacteria and 68% eukaryotic). The RaTG13 specimen was also only 0.01% virus genes compared to an average of 1.4% for authentic bat fecal specimens. A Krona analysis identified 3% primate sequences consistent with VERO cell contamination, the standard monkey cell culture used for coronavirus research, including at the Wuhan laboratory. Based on using the mean and standard deviation of the nine authentic bat fecal specimens from the Wuhan laboratory, the probability that RaTG13 came from a true fecal sample but had the composition reported by the Wuhan laboratory is one in thirteen million;
2) According to multiple references, RaTG13 was identified via Sanger dideoxy sequencing before 2016, partially sequenced by amplicon sequencing in 2017 and 2018, and then complete sequencing and assembly by RNA-Seq in 2020, although some reports from WIV suggest the timing of the RNA-Seq experiments may have been performed earlier than 2020. In any case, a Blast analysis of sequences from the amplicon and RNA-Seq experiments indicates an approximate 5% nucleotide difference, 50-fold higher than the technical error rate for RNA-Seq of about 0.1%. At least two gaps of over 60 base-pairs, with no coverage in the RNA-Seq data, were easily identified. The incomplete assembly and anomalous, method-dependent sequence divergence for RaTG13 is troublesome;
3) The pattern of synonymous to non-synonymous (S/NS) sequence differences between RaTG13 and SARS-CoV-2 in a 2201 nucleotide region flanking the S1/S2 junction of the Spike Protein records 112 synonymous mutation differences with only three nonsynonymous changes. Based on the S/NS mutational frequencies elsewhere in these two genomes and generally in other coronaviruses the probability that this mutation pattern arose naturally is approximately one in ten million. A similar pattern of unnatural S/SN substitutions was seen in a 10,818 nt region of the pp1ab gene. This pp1ab gene pattern has a probability of occurring naturally of less than one in 100 billion. A total of four regions of the RaTG13 genome, coding for 7,938 nt and about one-quarter of the entire genome, contain over 200 synonymous mutations without a single non-synonymous mutation. This has a probability of one in 10-17. A possible explanation, the absolute criticality of the specific amino acid sequence in the regions which might make a nonsynonymous change non-infective, is ruled out by the rapid appearance of an abundance of non-synonymous mutations in these very regions when examining the over 80,000 human SARS-CoV-2 specimens sequenced to date. An alternative hypothesis, that this arose by codon substitution is examined. It is demonstrated, by example from a published codon-optimized SARS-Cov-2 Spike Protein experiment, that the anomalous S/SN pattern is precisely the pattern which is produced, by design, when synthetic biology is used and represents a signature of laboratory construction.
Our blog already covered this topic three months back based on two different preprints from Daoyu Zhang (here) and Rahalkar and Rahalkar (here). All those authors came to similar conclusions as Steven Quay.
Another thing to note - these preprints are all being submitted to zenodo or researchgate, because biorxiv censors papers not falling under the party-sanctioned narrative or nor coming from “approved” authors. I personally experienced it and covered in “Biorxiv Fails Spectacularly as a Preprint Server”. Nobody from biorxiv reached out to me to explain what happened, and therefore I would encourage you to avoid biorxiv as a preprint server.