Biological Aspects of Evo and Evo2 - Semantic Mining

In the last three posts of this series (here, here and here), we covered the mathematical aspect of evo and evo2. Let us now discuss the biological findings from these models. It will take multiple posts to go over these topics.

In the popular versions of generative AI, you give a prompt to the AI tool and it comes up with a response. For example, you may give the machine one or two lines and ask it to write a paragraph or an essay for you. The output can also be an image or an audio. Maybe you ask the machine to create a picture of Napoleon wearing the clothing of Yongle Emperor from the Ming dynasty (or reverse), and the machine faithfully follows.

What is the equivalent for an AI model trained with DNA sequences, and is it likely to generate anything meaningful? That is the question the evo/evo2 team asked, and got some interesting results. Both evo and evo2 papers cover this topic, but they expanded on it in “Semantic mining of functional de novo genes from a genomic language model”.

Before continuing, let me quick go over a couple of important members of the evo/evo2 team. Two names you will see often are Patrick Hsu and Brian Hie. Patrick Hsu is a professor at UC Berkeley, who worked on the Bridge RNA paper (“Bridge RNAs direct programmable recombination of target and donor DNA”). Brian Hie is a professor of Chemical Engineering at Stanford. Also, together all these researchers, including some mentioned in my previous posts, are associated with Arc Institute, which is an independent organization researching AI models in biology.

Also, I will start using the term Multi-Parameter Statistics (MPS) or Massive Parameter Statistics (MPS) instead of artificial intelligence (AI) and machine learning (ML) to describe the models. The terms AI and ML are good for marketing, but they are utterly confusing in the scientific sense. These models are just generalized versions of linear regression or fitting a bell curve, but with massive number of parameters.

Getting back to the topic of semantic engineering, how does the evo/evo2 members use the MPS models to generate new proteins? Here is the approach they took in the evo paper. They first generated a toxin material by creating new DNA from the genomic region with the toxin-related gene. Then they used that MPS-generated toxin as the prompt to get suggestion for an anti-toxin pair from the model.

Using a sequential design strategy, we first generate novel toxins by prompting with genomic context around known toxin-antitoxin systems. We then use these Evo-generated toxins as prompts to design their conjugate antitoxin pairs, yielding functional toxin-antitoxin systems containing proteins with remote homology (< 30% sequence identity) to known proteins. We observed that 5 out of 10 designed proteins show antitoxin activity, representing a high experimental success rate of 50%.

They repeated the same strategy to find anti-CRISPRs proteins. These proteins are expected to inhibit the action of Cas9 in cleaving viral DNA and found 14 out of 84 MPS-predicted proteins to function correctly. Finally, they generated an entire bacterial genome using MPS. The “Semantic Mining” paper linked above describes the results in further detail.

Generative genomics models can design increasingly complex biological systems. However, effectively controlling these models to generate novel sequences with desired functions remains a major challenge. Here, we show that Evo, a 7-billion parameter genomic language model, can perform function-guided design that generalizes beyond natural sequences. By learning semantic relationships across multiple genes, Evo enables a genomic “autocomplete” in which a DNA prompt encoding a desired function instructs the model to generate novel DNA sequences that can be mined for similar functions. We term this process “semantic mining,” which, unlike traditional genome mining, can access a sequence landscape unconstrained by discovered evolutionary innovation. We validate this approach by experimentally testing the activity of generated anti-CRISPR proteins and toxin-antitoxin systems, including de novo genes with no significant homology to any natural protein. Strikingly, in-context protein design with Evo achieves potent activity and high experimental success rates even in the absence of structural hypotheses, known evolutionary conservation, or task-specific fine-tuning. We then use Evo to autocomplete millions of prompts to produce SynGenome, a first-of-its-kind database containing over 120 billion base pairs of AI-generated genomic sequences that enables semantic mining across many possible functions. The semantic mining paradigm enables functional exploration that ventures beyond the observed evolutionary universe.

One should note that their approach does not take any structural information into account, and can still predict proteins involved in various tasks. I will discuss their semantic mining paper, including their findings on protein homology and protein-protein interactions, in the following article.

‹»StripedHyena in Evo and Evo2« »Massively Parameterized Statistics«›