
ICLR/MLDD 2024 Health & Bio Conference Review
So much happening in ML for Biology and Health at ICLR and MLDD this year and no time to catch up?
As always, we’ve got you covered with a concise summary of ICLR and MLDD 2024 content focussed at the buzzing intersection of Bio, Health and AI!
Machine Learning for Drug Discovery (MLDD)
The first event of the week was the Machine Learning for Drug Discovery (MLDD) symposium featuring a stellar lineup of speakers: Lucy Colwell, David Baker, Pascal Notin, Thore Graepel, Smita Krishnaswamy, Shantanu Singh, and Žiga Avsec.
Generating and Analyzing Molecules via Learnable Geometric Scattering - Smita Krishnaswamy
Smita Krishnaswamy (Yale University) introduced graph scattering synthesis (GRASSY) that permits the steered generation of molecules via latent space interpolation.
Reflections on AI, Rejuvenation and Emergence - Thore Graepel
Thore Graepel (Altos Labs) presented an insightful view on the intriguing parallels between the scientific fields of rejuvenation and AI and between agent-based reinforcement learning and drug discovery with a view on emergence across ML systems and medicine.
Accurate proteome-wide missense variant effect prediction with AlphaMissense - Žiga Avsec
Žiga Avsec (Google DeepMind) introduced us to the exciting world of genetic sequence models with their recent work on accurately predicting the pathogenicity of missense variants using AlphaMissense.
Cell Painting powers next-generation phenotypic drug discovery - Shantanu Singh
Shantanu Singh (Broad Institute of MIT and Harvard) gave us a whirlwind tour through the exciting world of using cell painting datasets for drug discovery with many applications from gene characterisation to measuring complex phenotypes to virtual screening.
Machine Learning to predict protein function from sequence with therapeutic applications - Lucy Colwell
Lucy Colwell (Google Research and University of Cambridge) showcased outstanding work on (1) better understanding the sequence to function link and using this understanding to design candidate sequences and on (2) combining language and sequences to find proteins via text descriptions of their function.
Hybrid protein language models for fitness prediction and design - Pascal Notin
Pascal Notin (Harvard University) introduced the audience to the state-of-the-art in protein language models with applications to human variant annotation, viral escape prediction and the using conditional generation capabilities of pLMs to design proteins with specific properties.
De novo protein design - David Baker
David Baker (University of Washington) showcased their remarkable toolkit for protein structure prediction and design that is able to generate proteins binding to receptors, nanobodies binding to hotspots, form symmetric assemblies and that bind DNA with immense potential for medicine.
Keynotes
Machine Learning in Prescient Design's Lab-in-the-Loop Antibody Design - Kyunghyun Cho
In a highly recommended keynote, Kyunghyun Cho (Prescient Design/Genentech) kicks us off with a tour of their lab-in-the-loop for antibodies - showcasing challenges (large design space, multiple objectives), methods (Walk Jump Sampling), performance and the big picture in drug discovery.
The emerging science of benchmarks - Moritz Hardt
In his excellent keynote, Moritz Hardt (MPI for Intelligent Systems) tells us why the machine learning community works - despite not having many formal ‘rules’ - by using quantitative benchmarks with remarkable longevity and external validity (even with noisy labels).
Selected papers
Navigating the Design Space of Equivariant Diffusion-Based Generative Models for De Novo 3D Molecule Generation
Tuan Le et al (Pfizer, Freie Universität Berlin) explore the design space of de novo 3D molecule generation methods and develop a new approach to equivariant diffusion that sets a new state-of-the-art.
SaProt: Protein Language Modeling with Structure-aware Vocabulary
Jin Su et al (Zhejiang University and Westlake University) present a new protein language model (pLM) using a structure-aware vocabulary called SaProt - cotrained on residue and structure tokens (encoded via Foldseek). Trained on 40m sequences, SaProt is state-of-the-art on ProteinGym & ClinVar.
Protein Discovery with Discrete Walk-Jump Sampling
Frey et al (Prescient Design/Genentech) introduce discrete generative models using a learned energy function & a sampling and projection approach (walk-jump sampling). They generated functional antibodies in the lab with high hit rate.
BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks
Marin et al (University of Copenhagen) introduce discrete generative models using a learned energy function & a sampling and projection approach (walk-jump sampling). They generated functional antibodies in the lab with high hit rate.
Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View
Haoyue Dai et al (Carnegie Mellon University and Broad Institute of MIT and Harvard) tackle the issue of gene regulatory network inference under missing data (common in single cell RNAseq studies). They show sample deletion of zeros on cond vars can asymptotically recover causal relations.
Conclusion
In conclusion, the ML in bio and health community continues to make exciting progress:
- The world of proteins (structure, function and generation/design) continues to receive the largest share of ML research activity
- With analogues to natural language, more researchers are exploring modelling of biological sequences
- Causality is (finally) deservedly receiving more research
- With some years of experience, the community is starting to figure out what has promise and what does not (yet)
DISCLAIMER: The above list is a personal curation that most certainly missed many key contributions (in particular the many excellent workshop & competition contributions!) and is only intended to be a starting point for your own exploration.