Discussing the Evo and Evo2 Papers
Two recent papers applying AI-related large language models on DNA sequences are gaining a lot of attentions and a bit of controversy.
The first paper titled Sequence Modeling and Design from Molecular to Genome Scale with Evo wrote -
Trained on 2.7M prokaryotic and phage genomes, Evo can generalize across the three fundamental modalities of the central dogma of molecular biology to perform zero-shot function prediction that is competitive with, or outperforms, leading domain-specific language models. Evo also excels at multi-element generation tasks, which we demonstrate by generating synthetic CRISPR-Cas molecular complexes and entire transposable systems for the first time. Using information learned over whole genomes, Evo can also predict gene essentiality at nucleotide resolution and can generate coding-rich sequences up to 650 kb in length, orders of magnitude longer than previous methods.
The second paper (Genome Modeling and Design Across All Domains of Life with Evo 2) went even further to incorporate sequences from all domains of life. Here is the abstract -
We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions.
Speaking of controversies, a recent New York Times article recommended NIH to reduce funding for traditional biological research and instead channel money into large AI labs doing biology.
The Arc Institute, which does not get funding from the N.I.H., just released an artificial intelligence model called Evo 2, which is trained on DNA the way ChatGPT is trained on language. Evo 2 can predict if specific genetic mutations are harmful or help design new gene-editing systems, which could treat disorders including cystic fibrosis.
Projects like Evo 2 are increasingly the future of science, and they require infrastructure at a scale that traditional N.I.H. grants were never designed to support: massive computing clusters, specialized machine learning engineers and multimillion-dollar lab equipment.
The N.I.H. should pioneer a new funding mechanism to support scientific organizations with the flexibility to build that kind of infrastructure.
This suggestion annoyed a number of biologists, who remain skeptical about the broad claims made by the evo and evo2 papers. JHU professor Stephen Salzberg wrote -
color me skeptical. Actually, given the incredibly broad and bold claims in this paper, color me extremely skeptical
It is true that media hype alienates many serious people and fuels skepticism. Instead of relying on other people’s opinions, both positive and negative, I decided to dig deeper into the papers to understand what they were doing and to figure out the limitations of their approach, if any. In the next few blog posts, I will go over two evo papers and a number of associated papers in detail. This post will introduce key terms, because I assume people outside the AI field may not be familiar with them. Moreover, I will cover a number of digital signal processing techniques used by these papers, which all CS people may not be familiar with. Let us start with the basics.
Key Papers
The primary papers I am looking into are listed below. I am also reading a number of associated papers and will mention them in the related blog posts.
-
Genome modeling and design across all domains of life with Evo 2 by Garyk Brixi et al.
-
Sequence modeling and design from molecular to genome scale with Evo by Eric Nguyen et al.
-
Semantic mining of functional de novo genes from a genomic language model by Aditi T. Merchant et al.
-
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution by Eric Ngyuen et al.
-
Hyena Hierarchy: Towards larger convolutional language models by M. Poli et al.
-
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale by Jerome Ku et al.
Key Terms You Should Know
When you hear terms like artificial intelligence (AI) or neural networks (NN), do not think about human neurons or intelligence. In fact, they have no connection at all despite all neuron network books starting with a picture of human neuron.
Instead consider AI and NN as large mathematical functions with many parameters. Those parameters are estimated from existing data, and that process is called “training”. Then the “trained” mathematical function is used to make predictions about new input data. Conceptually the process is not different from the linear regression you are familiar with. A linear regression model can “intelligently” predict the output from input provided the system follows a straight line. AI models do the same for non-linear systems.
You will also often come across terms like “Turing test of intelligence”. Let me explain what it means by describing my encounter with one of Turing’s earlier students in a street market of Thailand. He tried to convince me that the fake handbag of his store was indistinguishable from brand-name handbags, and therefore it was the “real deal”. Well, I made up that story, but you probably get the point. Turing test of machine intelligence means the software or chatbot is so good in generating fake output that humans cannot tell that it is not real. However, it is still as fake as these Thai women.
Here are a few other terms you will encounter:
- Model: This refers to the mathematical function I mentioned above along with all its numerous parameters.
- Language Model: This type of model is trained on language or text data. A large language model uses a vast amount of text data to fine-tune its parameters.
- Foundational Model: A foundational model is a large language model trained on massive datasets. Make sure you have a rich father-in-law before embarking on this journey, because such training efforts are very expensive.
Often the foundational models are made available for free or for a cost. Someone with right skills can fine-tune the foundational model for specific tasks, and that is less expensive. Here is an example. Let us say a company making 3D printers likes to give “chatGPT” style access to all of its manuals to its customers. The company can potentially fine-tune a foundational model so that its responses are from the manuals only.
You are possibly wondering about how a mathematical model is applies on texts. It is very easy. Break the sentences into words, and then assign a number to each word. For example, let us assume that the numbers for “to”, “be”, “or” and “not” are 7, 17, 2 and 34. Then the sentence “to be or not to” becomes (7, 17, 2, 34, 7). In practice, each word gets a collection of numbers in the form of a vector are applied, but you get the basic idea.
You will also come across the term - “generative models”. Such models are trained to take a collection of words (in the form of numbers) to predict the next word(s). The training process tells the model that if the first four numbers are 7, 17, 2, 34, 7 then the next number is “17”. That is just the numerical representation of the sentence “To be or not to be”. The parameters of the model are adjusted to ensure that outcome. The same process is repeated with many, many different sentences, and voila, the trained model is ready to write poetry (or more likely technical manuals) !!
Two more terms you will often come across are “token” and “context”. Token refers to each word that is being replaced by a number. I should also mention here that when AI models work on languages, they do not use dictionary words as tokens. Sometimes they also use sub-words.
“Context” refers to how many words (or tokens) you look back to predict the next token. In case of the line from Shakespeare mentioned earlier, we may used only five words (“to be or not to”) to guess the next word, because the English playwright made this sentence famous. Usually that is not the case. For example, the text “can you guess the next …” can be followed by “word” in the context of this post, and “winner” in a discussion about a baseball game. So, you have to look back into many more words.
Evo and Evo2 Papers
With that introduction, let me explain the key innovation of the “evo” and “evo2” papers. They built large language models trained on massive amounts of DNA sequences. Why is that a big deal? Let me explain.
The authors used each nucleotide from the DNA sequences as a “token”, but in doing so, they encounted a technological hurdle. The “attention” mathematical technique used to model written text can take a context of only a few thousand nucleotides, or the training process becomes too expensive (even with the Nvidia CEO as your father-in-law). Therefore, they invented a novel mathematical approach based on concepts from signal processing and fast Fourier transform (FFT), and thus were able to extend the context. This key technical innovation made their work feasible.
In future posts, I will delve into the biological claims and the technological/mathematical details.