The Bitter Lesson of AI-driven Drug Discovery

The 2010s were a pivotal time for AI research, in particular for the subfield of reinforcement learning, bringing breakthrough results such as human-level control in Atari 2600 games (2015) and the champion-level Go agent, AlphaGo (2016). The combination of game engines as infinite data generating engines with learning methods that (inefficiently but eventually) were able to turn that data into better models ushered in a new scale-oriented paradigm for AI research that would, some years later, culminate in the large language models of today. Rich Sutton perhaps best encapsulated this paradigm shift in his Bitter Lesson arguing in favor of methods that scale with computation, as opposed to methods that attempt to leverage knowledge.

Today, we similarly find ourselves at an inflection point in AI-driven drug discovery that calls for reflecting on what may carry us forward. Despite great excitement in the 2020s for AI-driven drug discovery and the conducive backdrop of advances in computation, methods and data collection, only little progress in terms of actual medicines making it to patients¹ has been achieved relative to ambitions to develop 100s of drugs and solve all disease². Having seen this field evolve over the last decade, I argue that the lack of transformative impact to date is - at least in part - attributable to a persistent preference of AI-driven drug discovery practitioners for approaching drug discovery as an industrialised, formalised and repeatable exercise akin to a factory floor - when the reality is that drug discovery is nothing like a factory floor.

Drug discovery is, in fact, almost the polar opposite of a factory floor. In an industrial factory, your success is determined by how many products you produce and for what cost - this basic equation ensures a high correlation between the amount of effort and work you put in, and the eventual economic success you achieve. In contrast, in drug discovery, the best strategy is the one that takes the fewest possible steps that are not directly on the path of achieving an approvable drug product, as each idea you pursue that does not turn into an approved product comes with a large cost. Perhaps counter-intuitively, in general more effort in drug discovery is actually inversely correlated with economic success (if this effort does not directly contribute to raising the quality³ of the eventually approved product). The core reason for this difference in character is that - no matter where you insert your AI models in the drug discovery process - there always exists a more expensive and time-intensive downstream validation step that prevents you from conclusively validating all ideas, eventually culminating in the human clinical studies required by law.

To date, the AI-driven drug discovery field has nearly exclusively been working on “hypothesis-generation machines”, where each hypothesis carries very little validation, does not come with an understanding of the quality of the hypotheses generated, and therefore delivers in the first instance mainly large potential follow-up costs⁴. Examples include protein structure prediction methods that generate billions of approximate protein structures and, more recently, foundation models that attempt to predict experimental results for experiments that biologists largely have given up on due to their low signal for therapeutic discovery (e.g., cancer cell lines). The underlying assumption is that the manifold low-validation ideas would be validated in a not-further-specified downstream process - what is left unsaid is that there is no equally capable validation machine for testing therapeutic ideas at the scale that the hypothesis-generating machines are proposing them. Transformative impact, however, can only be achieved with medicines that make it to patients, not with mere hypotheses.

The Bitter Lesson of AI-driven drug discovery is that methods that primarily focus on raising the quantity of hypotheses available to be pursued are bound to not be transformative for drug discovery⁵^,⁶, and instead we should focus our efforts almost entirely on methods that increase the level of validation, i.e. the quality, of singular hypotheses as much as possible.

Unfortunately, working on hypothesis generation remains extremely attractive to AI in drug discovery researchers: (1) day-to-day generating more and more ideas subjectively feels like progress is being made - after all, you’ve created an enormous and ever-growing database of almost-medicines - and (2) science has trained researchers to value invention over doing the hard work to validate ideas, leading to a world in which everyone wants to be the idea-person and comparatively almost no-one wants to be doing the hard and unrewarding work to take these ideas all the way. And so, AI-driven drug discovery researchers still largely pursue hypothesis-generation machines and consequently, as a field, we unfortunately continue to make the same mistakes.

For individual researchers, the key take-away from the Bitter Lesson of AI-driven Drug Discovery is to attempt to resist the urge to further contribute to the pile of ideas that will never be followed up on, and rather think about how you can take ideas off of the pile and contribute to their validation - ideally all the way until they can be conclusively disproven, or (more rarely) proven to be high quality medicines. For organisations, the key take-away from the Bitter Lesson is to more thoroughly consider how they can systematically raise the quality of decisions made and build processes and environments that incentivise doing the hard work of progressing ideas versus pursuing the creation of yet more idea-generation factories that are at odds with the economic realities of drug discovery.

In short, instead of pursuing 100 almost-medicines, AI-driven drug discovery should aim for developing 1 actually approved medicine that helps 100 million people.

¹ There is a metric by which AI in drug discovery has been extremely successful: adoption. There is almost no sub-problem of drug discovery left where the state-of-the-art does not include the use of at least some form of AI/machine learning, and experts have already today ubiquitously adopted these tools into their workflows. However, this has so far not translated into an increased number of medicines being successfully developed with FDA approvals remaining stagnant over the last decade, nor has it led to an increased interest in investing into early clinical studies to advance therapeutic ideas towards patients (which would be a long-before leading indicator of success - see note ² below).

² Some may claim there has not been yet enough time to determine whether these ideas were in fact transformative for the industry. However, the key leading indicator for overall industry productivity - the number of clinical studies being started - has been greatly lagging behind world GDP growth and even regressing since the pandemic - strongly suggesting the industry as a whole has not gotten significantly better at surfacing high-quality medicine ideas that are worthy of investment despite widespread adoption of AI methods. If AI methods truly enabled a coming abundance of medicines in the next decade, you would expect to see increased clinical study activity reflecting the newfound abundance of high-quality ideas already years prior due to increased potential for investments. The average new medicine is in clinical development for longer than 10 years - no visible increase in trial activity likely means that if there were any transformative impact, it would be more than a decade away.

³ A possible measure of quality for therapeutic ideas would be the probability of successful translation into an approved medicine multiplied by its estimated positive clinical impact (patients helped, magnitude of clinical improvement over standard of care). I note that this is a very high bar - despite decades of experience, there is scant evidence for anything being systematically associated with better translation probability beyond human genetics and patient stratification.

⁴ Some simple napkin math may help illustrate just how important it is that you only pursue high-quality ideas in an environment where validation is supremely resource intensive. If the cost to sufficiently validate a single idea conservatively would be US$ 50M for a phase 2 clinical study and considering only efficacy-related failures, then your hypothesis-generating machine would need to produce at least one actually-efficacious therapeutic hypothesis worth at least US$ 1B for roughly every 20 hypotheses generated. Any less and your hypothesis-generating machine actually costs you more than it benefits you, as you pursue bad ideas at a rate that exceeds the number of successful proposals.

⁵ This may seem counter-intuitive at first, since having an idea - alongside the validation of that idea - is a necessary step on the journey of developing a medicine. However, I argue that therapeutic hypothesis generation has already long been saturated and is far out of balance vis-a-vis validation - validation is 99.9% of the work required to develop an approvable medicine, yet hypothesis generation receives more research attention than validation. The empirical proof that mere scaled idea generation does not impact drug discovery transformatively actually far predates AI in drug discovery. Scientific literature - an at this point essentially inexhaustible source of possible therapeutic ideas - is the original “hypothesis-generating machine” that has expanded exponentially in the last decades. Literature hypotheses being available readily and at vast scale has unfortunately not translated into an increase of successful medicines making it to patients.

⁶ Hypothesis generation being saturated is strictly only true if you are exploring primarily hypotheses that are among the cross product of established drug targets (gene products), drug modalities, and disease definitions - almost all such ideas are already on the pile of previously-stated hypotheses in scientific literature and primarily require validation work. Granted, underexplored blue oceans (e.g., pursuing non-traditional drug targets outside of gene products listed in databases) exist and here hypothesis generation could indeed possibly be more critical.