Contamination Nightmare in Microbial Genome Assemblies
While working on RNAse P in microbial genomes, I noticed something very puzzling. An archaeal protein that was never seen before in bacteria was present (and even annotated) in a newly sequenced bacterial genomes. If true, it could completely change the evolutionary understanding of the RNase P protein families.
More often than not, such unsual observations can be ascribed to contaminations and pipeline errors, as Steven Salzberg’s group reported in a recent paper. In Human contamination in bacterial genomes has created thousands of spurious proteins, they observed -
Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein “families” across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely-used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred on small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.
Human contamination is only one among various sources of errors and is easier to rectify than archaea being sold as bacteria or vice versa. The biggest nuisance, however, is the propagatation of incorrect annotations from the automated pipelines. If a protein is incorrectly annotated in an organism being used as the primary source and everyone else copies that annotation using BLAST matches, the error can propagate for years through various pipelines. Moreover, due to majority voting methods being used in these pipelines, the corrected function will be voted down against numerous erroneous matches. I will show an example in a later blog post, if our readers are interested.