A few months back, Eran Elhaik privately shared a preprint of his article on indiscriminate use of PCA in population genetics. I thought it would challenge many accepted discoveries in the field. The paper is currently available at biorxiv as “Why most Principal Component Analyses (PCA) in population genetic studies are wrong”.
Principal Component Analysis (PCA) is a multivariate analysis that allows reduction of the complexity of datasets while preserving data’s covariance and visualizing the information on colorful scatterplots, ideally with only a minimal loss of information. PCA applications are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics), implemented in well-cited packages like EIGENSOFT and PLINK. PCA outcomes are used to shape study design, identify and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, whereabouts, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We employed an intuitive color-based model alongside human population data for eleven common test cases. We demonstrate that PCA results are artifacts of the data and that they can be easily manipulated to generate desired outcomes. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns on the validity of results reported in the literature of population genetics and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations. An alternative mixed-admixture population genetic model is discussed.
I asked Elhaik what he thought of tSNE and UMAP, two newly used clustering methods commonly applied to single cell data but increasingly in population genetics as well. He argued that they would be susceptible to the same artifacts. Therefore, I am excited to find the new preprint from Lior Pachter’s group titled “The Specious Art of Single-Cell Genomics”.
Dimensionality reduction is standard practice for filtering noise and identifying relevant dimensions in large-scale data analyses. In biology, single-cell expression studies almost always begin with reduction to two or three dimensions to produce ‘all-in-one’ visuals of the data that are amenable to the human eye, and these are subsequently used for qualitative and quantitative analysis of cell relationships. However, there is little theoretical support for this practice. We examine the theoretical and practical implications of low-dimensional embedding of single-cell data, and find extensive distortions incurred on the global and local properties of biological patterns relative to the high-dimensional, ambient space. In lieu of this, we propose semi-supervised dimension reduction to higher dimension, and show that such targeted reduction guided by the metadata associated with single-cell experiments provides useful latent space representations for hypothesis-driven biological discovery.
What are the alternatives? Elhaik proposes “mixed-admixture population genetic model”, whereas Chari et al. suggests “semi-supervised dimension reduction to higher dimension”.