Training Approach in Evo and Evo2
In the earlier posts of this series (here, here, here and here), we covered the mathematical and biological aspects of evo and evo2. One important topic that we have not covered yet is how the models were trained.
What is training? The AI or deep learning models, or as we prefer to call “Massively Parameterized Statistical (MPS) Models”, are mathematical functions with a large number of parameters. Those parameters need to be determined from existing data. Conceptually it is not different from determining mean and standard deviation of a Gaussian distribution (Bell curve) from data except that the number of parameters is often in the billions. Over the years, the MPS field developed a set of standard procedures for training their models, and evo/evo2 papers follow them more or less.
Training Data for Evo
First question is which data are used to train these models. The evo paper, which is based on bacterial and archael genomes, used a combination of GTDB (for bacteria/archaea) and IMG/VR (for viruses) datasets.
The OpenGenome pretraining dataset (S3 for summary statistics) was compiled from three different sources: 1) Bacterial and archaeal genomes from the Genome Taxonomy Database (GTDB) v214.1 (Parks et al., 2015), 2) curated prokaryotic viruses from the IMG/VR v4 database (Camargo et al., 2023), and 3) plasmid sequences from the IMG/PR database (Camargo et al., 2024). For GTDB, representative genomes for each species were retained to reduce data redundancy. For IMG/PR, only one representative per plasmid taxonomic unit (PTU) was kept.
Interestingly, they fine-tuned their raw model for CRISPR/Cas and IS200/IS605 family DNA transposases. Fine-tuning is the process of starting with a MPS model for which the parameters have been determined based on various large datasets, and then readjust the parameters for some specific tasks (such as finding CRISPR/Cas regions).
Training Data for evo2
Evo2 is trained based on (i) updated GTDB set from 2022 with 28,174 bacterial and archael genomes, (ii) 16,704 eukaryotic reference genomes from NCBI, (iii) 41,253 metagenomes and metagenome-assembled genomes, (iv) 33.457 eukaryotic organelle genomes, (v) mRNA and ncRNA transcripts of 4,390 references genomes from NCBI, (vi) noncoding RNAs, (vii) eukaryotic promoter sequences.
Training Procedure
For those not familiar with how these MPS models are trained, let me walk you through a quick example. Suppose you are training the model with the E. coli genome. First step is to choose the context, or the size of the text you need to use to predict the next token. The terms are explained in the first post of the series.
Suppose you choose a context of 1000. That means you take the first 1000 nucleotides (position 1-1000) of the linearized E. coli genome as input and the next nucleotide (position 1001) as the expected output, and convert all texts into numbers. The parameters of the model are adjusted using a method called backpropagation so that the given input produces the expected output. Then you repeat the same process for position 2-1001 of the genome as input and position 1002 as output, and continue to do the same for the entire genome. You repeat the above procedure multiple times for all other genomes in the training set until some benchmarks are met.
Evo used two contexts for training. First, they ran pretraining with context length of 8000 nucleotides, and then used context length of 131000 nucleotides.
Like most language models, Evo is pretrained via a next-token prediction objective on raw genome sequences with no explicit supervision or annotations. In order to predict the next token given a sequence of tokens, the model must learn the distribution of the genome data and be aware of the biological sequence motifs found in the collected genomes. Pretraining involves 2 stages: the first stage uses a context length of 8k tokens, while the second context extension stage uses 131k tokens as context. Depending on the downstream task, we select a base model from one of the two stages to finetune on smaller datasets of interest for generation.
The training procedure for evo2 is similar.
We train Evo 2 in two phases: a pretraining phase at 8192 token context focused more on functional elements and midtraining phase during which we extend up to 1M token context length with more entire genomes in the data mix. Evo 2 40B’s pretraining stage is further split into two stages, first training at 1024 context for 6.6T tokens before extending to 8192 context for 1.1T tokens. Additionally, we train and release a smaller, Evo 2 1B base at 8192 context length for 1T tokens. For efficiency, Evo 2 is trained using sequence packing.
Due to the size of the training dataset, evo2 paper had to perform plenty of preprocessing on the data. I will go through them in detail in a later post.