Important Covid-related Datasets Disappeared from NCBI SRA

The origin of SARS coronavirus causing the pandemic is still a mystery due to paucity of early data. This is puzzling because Wuhan, where the pandemic started, is equipped with world-class virology labs. A recent finding by Jesse Bloom, a virologist from Fred Hutch, suggests that we are likely being deliberately misled. By checking the internet caches, he recovered an entire set of early measurements deleted from the NCBI SRA database in March 2020, possibly based on an order from the Chinese government. Incorporating these early measurements point to a progenitor of SARS-Cov2 different from the commonly accepted one.

In January 2021, Sudhir Kumar and colleagues posted a preprint in biorxiv claiming that the earliest progenitor of SARS-CoV2 differed from the commonly accepted one by three nucleotides. They used a “a novel application and advancement of computational methods initially developed to reconstruct the mutational history of tumor cells in a patient”. Their result was surprising, because none of the early patient data released from Wuhan confirmed these variants.

The recent finding by Bloom resolves this mystery. He located a deleted sequencing dataset submitted to NCBI SRA in February 2020. This dataset was deleted from SRA one month later right around when the Chinese government ordered all researchers to get government approval prior to releasing any data/paper on Covid. This dataset covered Nanopore sequencing of early patients conducted by Wang et al, who also posted a preprint in medrxiv around the same time. Interestingly, the paper by Wang et al. got published later in an “official” journal even though the raw data backing their paper disappeared. Luckily, Bloom managed to recover almost all of the sequencing files from google cloud caches. He initially could not locate two files, but internet sleuths helped him track down the files.

His reanalysis agreed with the observations made by Sudhir Kumar and disputed the often repeated claim that Huanan Seafood Market was the origin of the pandemic. Quoting from the paper - “Phylogenetic analysis of these sequences in the context of carefully annotated existing data further supports the idea that the Huanan Seafood Market sequences are not fully representative of the viruses in Wuhan early in the epidemic.” It also opened up a bigger question - why are the Chinese authorities working so hard (going to the extent of deleting files from NCBI SRA) to hide the origin of the virus?

‹»Discover Novel Viruses Without Leaving Your Couch« »Biorxiv Fails Spectacularly as a Preprint Server«›