A new preprint titled “Legacy Data Confounds Genomics Studies” is recently posted in biorxiv. It shows that the researchers using data from 1000-genome project need to be cautious about garbage-in-garbage-out effect (technical term: batch effect) leading to spurious discoveries.
Rapid Mutation in the Genomes of Japanese Population
In 2017, Harris and Pritchard studied the mutation spectrum of human genomes based on data from the 1000 genome project and reported their findings in “Rapid evolution of the human mutation spectrum”. Identifying heterogeneous mutational signature among the closely related Japanese people was one of their intriguing findings.
In some cases, mutational spectra differ even between very closely related populations. For example, the AC→CC mutations with elevated rates in East Asia appear to be distributed heterogeneously within that group, with most of the load carried by a subset of the Japanese individuals. These individuals also have elevated rates of ACA→AAA and TAT→TTT mutations (Figure 4A and Figure 4—figure supplement 4). This signature appears to be present in only a handful of Chinese individuals, and no Kinh or Dai individuals.
They are All “Fake Mutations”
Authors of the new preprint followed up by resequencing the genomes of Japanese population but did not observe the same mutation pattern. They concluded that the observed difference was due to technical artefact in the earlier data from the 1000 genome project (1kGP). So, just like fake news, we need to worry about fake mutations corrupting research findings.
While we were unable to reproduce the mutational heterogeneity within the Japanese population, we could trace back the source of the discrepancy to a technical artefact in the 1kGP data. In addition to creating biases in mutational signatures, this artefact leads to spurious imputation results which have found their way in a number of recent publications.
Given the central role played by 1kGP data in human genomic research, we wonder how many hyped up GWAS discoveries need to be revisited. Moreover, other large-scale data sets, such as exome sequencing data from TCGA cancer project, are not immune to batch effects. Rasnic et al. recently reported - Substantial Batch Effects in TCGA Exome Sequences Undermine Pan-Cancer Analysis of Germline Variants.