Massively Parameterized Statistics

Massively Parameterized Statistics

In this article, I will argue that Multi Parameter Statistics, or even better, Massively Parameterized Statistics (MPS) better describes the application of AI models in biology and medicine. Also, I will introduce you to a new preprint on DNA sequence modeling that claims to match evo.

The terms like ML and AI used by computer scientists often downplay math and statistics, because the computer scientists try to distinguish their crafts from the traditional fields. For others, this discontinuity is misleading. A number of well-respected statisticians wrote two books on what the CS people call machine learning and renamed the field as statistical learning. (“The Elements of Statistical Learning”, “An Introduction to Statistical Learning”). They convincingly argue that “machine learning” is a misnomer, and finding patterns from data is just old-fashioned statistics.

The name artificial intellegence or AI is even more misleading, when applied to DNA or other biological data. The term artificial intellegence was originally used, because the goal was to build models to mimic human speech or human actions. So, when computer scientists try to generate human-like speech from a computer program, the name AI at least matches the goal of the exercise.

Generating new sequences by giving DNA prompts like what I described yesterday in “Biological Aspects of Evo and Evo2 - Semantic Mining” does not mimic human speech. The same is true, when the AI models are applied to transcriptome data or any other type of structured biologicsl data. So, the practice deserved a new name that properly describes the actions.

If there is one thing these models like to highlight, it is how many parameters they have. From the abstract of the evo paper -

Using an architecture based on advances in deep signal processing, we scale Evo to 7 billion parameters with a context length of 131 kilobases (kb) at single- nucleotide, byte resolution.

Evo2 takes it one level higher -

We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution.

Similarly, the paper “Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale” that claims to match evo writes -

AIDO.DNA is a seven billion parameter encoder-only transformer trained on 10.6 billion nucleotides from a dataset of 796 species.

In fact, you can see Table 1 from this later paper, where it compares various models based on tokenization lengths, training contexts and the number of parameters.

Therefore, Massively Parameterized Statistics is the best description of this field. This name also shows continuity with traditional statistics and statistical learning commonly used in biology and medicine. “Deep learning” is the other commonly used term in CS literature, where “deep” refers to “massively parameterized” and “learning” implies statistics. I think my description makes the process more transparent and therefore easy to follow.

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

I came across this paper posted at biorxiv. It claims to do tasks similar to evo, but with context length of 4Kb instead of 131Kb with a transformer-based model (BERT).

Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research. Despite the successes of protein language models, genome language models remain nascent. Recent studies suggest the bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated. However, it may be the case that even short DNA sequences are modeled poorly by existing approaches, and current models are unable to represent the wide array of functions encoded by DNA. To study this, we develop AIDO.DNA, a pretrained module for DNA representation in an AI-driven Digital Organism [1]. AIDO.DNA is a seven billion parameter encoder-only transformer trained on 10.6 billion nucleotides from a dataset of 796 species. By scaling model size while maintaining a short context length of 4k nucleotides, AIDO.DNA shows substantial improvements across a breadth of supervised, generative, and zero-shot tasks relevant to functional genomics, synthetic biology, and drug development. Notably, AIDO.DNA outperforms prior encoder-only architectures without new data, suggesting that new scaling laws are needed to achieve computeoptimal DNA language models. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face at https://huggingface.co/genbio-ai.

I will be writing more on it in later posts.


Written by M. //