Evo and Evo2 - Math and Algorithm
In the first post of this series, we covered the basic technical terms of the evo and evo2 papers. We also mentioned the key technological innovation that made their work possible. That led to the question - if they were using fast fourier transform (FFT), were they using convolutional neural network (CNN)? The answer is no. The computer science work done by the Stanford group is quite groundbreaking. Let me go over that in detail.
To start with, I will mention some of the key people involved in this work. You will find two Stanford professors involved in this project, namely Christopher Ré and Stefano Ermon. However, the graduate students are often the brains behind many innovations. I will not mention everyone here, and instead link to the related papers. The names I encountered most frequestly were Albert Gu and Tri Dao. They graduated, and then the new ones (Michael Poli, Eric Nguyen) took over. Also, this is just the computer science side of this project, whereas the biology side has several others involved. Lastly, when you read their papers, you will encounter the names of many animals like hippos and hyenas as well. What is going on?
I will start with briefly discussing why their work is innovative, and then briefly cover the individual papers. This topic is a bit computer sciency. If you are from biology, you can skip the next few paragraphs and jump straight to the last section.
In computer science, the efficiency of algorithms are measured as NN or Nlog(N). For large amount of data, the later type of algorithms run way faster, but coming up with such algorithms is no easy task. The popular algorithm behind the current AI boom uses “attention”. You do not need to know the math, but suffice it to say that it runs as LL. The original papers behind evo and evo2 came up with a Llog(L) algorithm, and that means the algorithm can handle larger L. L here is the context length, or the number of tokens being looked into to guess the next token. Please check the previous post for explanation of these terms.
I will next mention a few key papers (and animals).
In 2020, Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra and Christopher Ré published HiPPO: Recurrent Memory with Optimal Polynomial Projections, where they came up with a general mathematical framework to describe various deep learning algorithms retaining memory. If “retaining memory” sounds too technical, think of the problem of starting with a set of words to guess the next word, i.e. chatGPT you are familiar with. Also, HiPPO here stands for high-order polynomial projection operators.
In 2021, Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra and Christopher Ré worked on Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. This paper continued along the same path to introduce “Linear State Space Layer (LSSL)”.
Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings.
The most cited paper of this series (and the one you should read to find all math in one place) is Efficiently Modeling Long Sequences with Structured State Spaces from Albert Gu, Karan Goel, and Christopher Ré. Those were followed by Hungry Hungry Hippos: Towards Language Modeling with State Space Models, Hyena Hierarchy: Towards Larger Convolutional Language Models and others. One paper not mentioned here came up with an efficient way to do long FFT in a GPU.
Overall, they borrowed from a mathematical model used in control theory, discretizes it and found it to perform as well or better than many of the commonly used neural network models. Also, they could easily write the math in the form of convolution.
How is this innovation related to modeling DNA sequences? As I mentioned in the previous post, one challenge encountered by those working on DNA sequences using traditional “attention” style math of GPT is that they could not use longer context, or rather they could not look into hundreds of thousands of nucleotides to guess the next nucleotide. That problem was solved by this algorithmic innovation. In the next posts, I will go over the biological aspects of their work.