Mathematical Side of AI and Its Applications in Biology

Right now everyone is either enamored of or is completely turned off by AI. Speaking of being fed up, please check the students at University of Florida booing at this commencement event. Much of their reactions comes from intense hype generated by the tech-bros of Silly Con valley to increase stock prices of their companies.

Instead of emotionally swayed by hyped-up projections, let us learn the underlying technology/mathematics and see how it can be applied beneficially. You will find that a large part of these technologies are available from standard libraries, and you do not need to be a math or computing genius to use them on new problems in your domain. In these posts, I will discuss various applications of “AI” in biology and related sciences. AI is in quote here, because we will use definition from the third layer, whereas most people know AI as the first layer (chatbots made by large-language models).

As I wrote before, conceptually “AI” or “deep learning” can be better described as massively parameterized statistics. Think of it as a generalized version of linear regression, but one key difference. In linear regression, you fit a curve through numerical data, whereas in deep learning, you fit curves through non-numerical data, such as text or molecular structures.

With that introduction, let us look at some of the current and past applications. In this post, I will talk about them conceptually and then go deeper with math and coding in future posts.

Protein-folding - Alphafold, Helixfold, etc.

By now, everyone in this field knows about Alphafold and similar protein-folding models. Let me use their examples to conceptually explain what “AI” is doing. Once again think about fitting a “regression line” with many protein sequences on the x-axis and their structures on the y-axis. As you can see, the protein sequences on the x-axis are not numbers, and therefore we are not solving the familiar problem from statistics. This is where mathematics from deep learning helps. Once again, this conceptual understanding is enough to get started, and then you can use libraries like Pytorch for actual implementation.

Evo - DNA models

Evo is different from other applications presented here, because they introduce new math in their models. We got into some of it in the prior discussions on evo and evo2 papers. I will continue on this topic in future.

Discussing the evo and evo2 papers

Evo and Evo2 - math and algorithm

StripedHyena in Evo and Evo2

Biological aspects of Evo and Evo2 - Semantic mining

Training approach in evo and evo2

Finding small-molecule antibiotics

Last year, MIT news reported -

Using generative AI, researchers design compounds that can kill drug-resistant bacteria

With help from artificial intelligence, MIT researchers have designed novel antibiotics that can combat two hard-to-treat infections: drug-resistant Neisseria gonorrhoeae and multi-drug-resistant Staphylococcus aureus (MRSA).

Over the last few months, I have been reading the papers by James Collins’ group and find them quite interesting. The best place to start is this 2020 paper on halicin. The math of their work is fascinating, because they need to fit their models on molecules. However, you do not need to hand-code them. Instead use the Python library RDKit to implement the ideas.

Finding anti-microbial peptides

AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens

Here we introduce AMPlify, an attentive deep learning model for AMP prediction, and demonstrate its utility in prioritizing peptide sequences derived from the Rana [Lithobates] catesbeiana (bullfrog) genome. We tested the bioactivity of our predicted peptides against a panel of bacterial species, including representatives from the World Health Organization’s priority pathogens list. Four of our novel AMPs were active against multiple species of bacteria, including a multi-drug resistant isolate of carbapenemase-producing Escherichia coli.

Mathematically this is closer to the first example. Once again, you do not need to reinvent the wheel, and instead can use the Pytorch library to implement their ideas. Rushil, one of my students, will write on this topic more extensively.

‹»Gene Regulatory Network Reconstruction with Single-cell Data« »Biology is Messy, Because Physicists Failed«›