So much happening in ML for Biology and Health at ICLR and MLDD this year and no time to catch up?

As always, we’ve got you covered with a concise summary of ICLR and MLDD 2024 content focussed at the buzzing intersection of Bio, Health and AI!

Machine Learning for Drug Discovery (MLDD)

The first event of the week was the Machine Learning for Drug Discovery (MLDD) symposium featuring a stellar lineup of speakers: Lucy Colwell, David Baker, Pascal Notin, Thore Graepel, Smita Krishnaswamy, Shantanu Singh, and Žiga Avsec.

[Recording link]

Generating and Analyzing Molecules via Learnable Geometric Scattering - Smita Krishnaswamy

Smita Krishnaswamy (Yale University) introduced graph scattering synthesis (GRASSY) that permits the steered generation of molecules via latent space interpolation.

Reflections on AI, Rejuvenation and Emergence - Thore Graepel

Thore Graepel (Altos Labs) presented an insightful view on the intriguing parallels between the scientific fields of rejuvenation and AI and between agent-based reinforcement learning and drug discovery with a view on emergence across ML systems and medicine.

Accurate proteome-wide missense variant effect prediction with AlphaMissense - Žiga Avsec

Žiga Avsec (Google DeepMind) introduced us to the exciting world of genetic sequence models with their recent work on accurately predicting the pathogenicity of missense variants using AlphaMissense.

Cell Painting powers next-generation phenotypic drug discovery - Shantanu Singh

Shantanu Singh (Broad Institute of MIT and Harvard) gave us a whirlwind tour through the exciting world of using cell painting datasets for drug discovery with many applications from gene characterisation to measuring complex phenotypes to virtual screening.

[Slides link]

Machine Learning to predict protein function from sequence with therapeutic applications - Lucy Colwell

Lucy Colwell (Google Research and University of Cambridge) showcased outstanding work on (1) better understanding the sequence to function link and using this understanding to design candidate sequences and on (2) combining language and sequences to find proteins via text descriptions of their function.

Hybrid protein language models for fitness prediction and design - Pascal Notin

Pascal Notin (Harvard University) introduced the audience to the state-of-the-art in protein language models with applications to human variant annotation, viral escape prediction and the using conditional generation capabilities of pLMs to design proteins with specific properties.

De novo protein design - David Baker

David Baker (University of Washington) showcased their remarkable toolkit for protein structure prediction and design that is able to generate proteins binding to receptors, nanobodies binding to hotspots, form symmetric assemblies and that bind DNA with immense potential for medicine.

Keynotes

Machine Learning in Prescient Design's Lab-in-the-Loop Antibody Design - Kyunghyun Cho

In a highly recommended keynote, Kyunghyun Cho (Prescient Design/Genentech) kicks us off with a tour of their lab-in-the-loop for antibodies - showcasing challenges (large design space, multiple objectives), methods (Walk Jump Sampling), performance and the big picture in drug discovery.

The emerging science of benchmarks - Moritz Hardt

In his excellent keynote, Moritz Hardt (MPI for Intelligent Systems) tells us why the machine learning community works - despite not having many formal ‘rules’ - by using quantitative benchmarks with remarkable longevity and external validity (even with noisy labels).

Selected papers

Navigating the Design Space of Equivariant Diffusion-Based Generative Models for De Novo 3D Molecule Generation

Tuan Le et al (Pfizer, Freie Universität Berlin) explore the design space of de novo 3D molecule generation methods and develop a new approach to equivariant diffusion that sets a new state-of-the-art.

[Paper link]

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Jin Su et al (Zhejiang University and Westlake University) present a new protein language model (pLM) using a structure-aware vocabulary called SaProt - cotrained on residue and structure tokens (encoded via Foldseek). Trained on 40m sequences, SaProt is state-of-the-art on ProteinGym & ClinVar.

[Paper link]

Protein Discovery with Discrete Walk-Jump Sampling

Frey et al (Prescient Design/Genentech) introduce discrete generative models using a learned energy function & a sampling and projection approach (walk-jump sampling). They generated functional antibodies in the lab with high hit rate.

[Paper link]

BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks

Marin et al (University of Copenhagen) introduce discrete generative models using a learned energy function & a sampling and projection approach (walk-jump sampling). They generated functional antibodies in the lab with high hit rate.

[Paper link]

Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View

Haoyue Dai et al (Carnegie Mellon University and Broad Institute of MIT and Harvard) tackle the issue of gene regulatory network inference under missing data (common in single cell RNAseq studies). They show sample deletion of zeros on cond vars can asymptotically recover causal relations.

[Paper link]

Conclusion

In conclusion, the ML in bio and health community continues to make exciting progress:

  • The world of proteins (structure, function and generation/design) continues to receive the largest share of ML research activity
  • With analogues to natural language, more researchers are exploring modelling of biological sequences
  • Causality is (finally) deservedly receiving more research
  • With some years of experience, the community is starting to figure out what has promise and what does not (yet)


DISCLAIMER: The above list is a personal curation that most certainly missed many key contributions (in particular the many excellent workshop & competition contributions!) and is only intended to be a starting point for your own exploration.