An idiot’s guide to lead optimisation for proteins

An idiot’s guide to lead optimisation for proteins(magnusross.github.io)

176 points by magni121 48 days ago | 16 comments

Oh heck, this is awesome to see on the front page! I wrote the underlying Cradle-1 paper that is being discussed!

I used to work for Cradle and writing this paper was the last thing I did before leaving – on good terms – to found my own startup. :D And we'll 100% be using Cradle for our lead optimization.

(On the off-chance: I'm at PEGS Boston this week chatting all things AI+antibodies, in particular for rare diseases. If this topic is of interest to any other protein+tech geeks here then send me an email, let's grab coffee.)

thadk 45 days ago |

Anyone else read this as "An idiot's guide to Pb optimization for proteins," as in avoiding contaminated dietary protein isolates?

QuercusMax 45 days ago | |

The article at least explains what it means in literally the first sentence, which is a lot better than half the things posted on HN!

softbuilder 45 days ago | |

My first impression was that it was something about sales.

pugio 45 days ago | |

Thank you, yes. I was very confused.

theophrastus 45 days ago |

After spending an entire career doing 'by hand' (and a helluva lot of molecular orbital calculations) on the problem this post is about, i've got to tersely weigh in with: there's (still) not enough available data given the size of protein 'phase space' to hope for a proper covering with one's trained up linear algebra model. Or typed another way: you've got to include at some stage some physical modeling parameters, like molecular orbitals [1], otherwise the 'response curve' will only optimize if one gets quite lucky, (which is actually unlucky as then you'll delude yourself into thinking it's a generally applicable, which it isn't). For instance, swap in a carboxylic acid moiety where there was previously an aldehyde, a protein side-chain flips over, and you're in a completely different corner of the energetic 'galaxy'.

[1] e.g. https://proteindf.github.io/

phreeza 45 days ago | |

That seems possible for generating completely new proteins.

Do you think it's also the case for lead optimization where you typically have some degree of measurements around your starting point, and you are expecting to stay in that local neighborhood for the generated candidates, too?

(Disclaimer: former Cradle employee here)

patrickkidger 45 days ago | | |

Oh hello Thomas, fancy seeing you here :D ex-Cradlers unite!

patrickkidger 45 days ago | |

I'll offer a +1 to the sibling comment here.

Yeah it's totally true you can't build a one-size-fits-all foundation model, the data just isn't there. But also... no-one needs that. It's totally fine to tweak a foundation model for any individual problem, and that's the bulk of what is being described in the linked blog post / in the underlying paper.

FWIW whilst at Cradle we had a lot of doubts going into this. Like, thermostability is clearly evolutionarily correlated so it was always pretty likely that by hook or by crook the models could do that correctly. But, binding? Aggregation? Not at all clear that the same principles should hold. And the exciting finding was that yes, yes they do.

the__alchemist 45 days ago |

It sounds like this is mostly (or exclusively?) operating directly on AA seqs. I wonder what the upper limit of capability this is for the intended use case. As in, without incorporating the 3d chemistry or spacial reasoning. E.g. classical MD, DFT etc like ORCA performs etc. Of particular interest: Does this upper bound (assuming it exists; I suspect it does) preclude its utility in practical protein design/gen.

I speculate Cradle is taking the approach they are vs structural/spacial, as structure spacial models don't work very well on big molecules like proteins! (And/or are too slow; errors accumulate over space etc)

BigTTYGothGF 45 days ago |

> amino acids of which there are 20 different types

20 different types coded for, but once you get into PTMs that number goes way up.

dnautics 45 days ago |

how many therapeutic proteins are there that aren't mabs or ~naturally occurring proteins (insulin, modified insulins, hirudin, cerezyme etc)?

I can think of:

etanercept

toufka 45 days ago | |

A large and rapidly growing number.

The largest commercial classes of multi-domain therapeutic proteins include the crispr (and similar) that drive gene therapies, and the chimeric antigen receptors (and similar) that drive cell therapies.

But lead optimization there look different than this page’s efforts.

dnautics 45 days ago | | |

Oh yeah good point on crispr and the antibody chimeras (which etanercept is)

I guess I imagine one of the highest order obstacle to protein therapeutics to be immunogenicity, which is really hard to design around for a de Novo protein

evalu 45 days ago |

future of protein engineering?