Machine Learning for Precision Oncology and Drug Design

See the publications

Mitochondria, potential new therapeutic targets in pancreatic cancer - CRCM PhD Day 2021 will take place on May 28 2021 - La cycline A2, qui maintient l'homéostasie du côlon, est un facteur de pronostic dans le cancer colorectal -

Machine Learning for Precision Oncology and Drug Design (MLPODD)

Our research focuses on the development and application of computational methods to predict and analyse the modulation of protein and cell function by small organic molecules. These problems can be tackled by generating predictive models from relevant data using machine learning (an approach that has been recently rebranded as AI for Drug Discovery). Within this area, problems of interest include predicting treatment response of tumours from their molecular profiles for precision oncology, cancer pharmaco-omics modelling for phenotypic drug design, molecular target prediction by bioactivity data mining and target-based drug design (e.g. structure-based virtual screening guided by highly-predictive machine-learning scoring functions).

Precision Oncology

The efficacy of a drug treatment is strongly cancer patient-dependent. There is hence a great need to investigate computational methods able to predict which patients will respond to a given treatment. Many thousands of numerical features are often describing each tumour (e.g. those coming from cheap and fast molecular profiling technologies, such as RNA-seq or Methyl-Seq). Machine learning can be used to identify which combinations of these gene alterations can predict treatment response and thus guide precision oncology efforts. Unfortunately, the number of tumours of a given cancer type that have been both molecularly profiled and treated with the same drug is generally small (it rarely exceeds 100). Such high-dimensional classification problems are hard, as many algorithms struggle to build classifiers ignoring the thousands of irrelevant features.

We are investigating the integration of feature selection with machine learning algorithms to build classifiers that only make use of a much smaller subset of features (those most discriminative). For instance, when systematically analysing a comprehensive in vivo data set (1), we have observed that identifying an optimal subset of features using random forest as the base learner results in predictive models for most cancer types, treatments and profiles. We are also interested is the challenge of how to best interpret a prediction in terms of the selected gene alterations in order to explain why a specific tumour is sensitive or resistant to the treatment.


We have compared the standard approach of identifying single-gene markers to the emerging multi-gene approach of combining multiple gene alterations with machine learning using the same in vitro pharmaco-genomic data (2, 3). We have looked at the same question using in vivo preclinical data (1) and we are currently investigating this issue with in vivo clinical data as well.

All these studies reveal that a higher proportion of cancer type-treatment binomials can be accurately predicted if: 1) multi-gene classifiers are built (especially those integrating feature selection), 2) a higher number of machine learning algorithms is employed, and 3) a higher number of molecular profiles is considered. By systematically comparing single-gene and multi-gene classifiers, we have also found out that the characteristic low recall (sensitivity) of a single-gene marker is not an intrinsic limitation of precision oncology, but a result of using a single-feature classifier instead of one effectively combining multiple gene alterations (1, 3).

We are currently investigating the application of the developed tools to clinical pharmaco-omic data sets, as those coming from acute myeloid leukaemia and metastatic breast cancer patients.

Drug Design

In addition to research intended to optimise the application of known drugs, there is a constant need to discover new drugs to treat cancer patients who do not respond to first-line treatments, relapse and/or have poor-prognosis with current treatments. This cannot be achieved without a way to identify molecules modulating a specific biological function of a therapeutic target. There is now a range of computational methods able to predict the biological activities of a molecule from ever-increasing volumes of relevant experimental data. For instance, Virtual Screening (VS) methods can be used to search vast libraries of molecules for those likely to be active against the considered target. In practice, these tools have been able to discover drug leads in a wide range of targets and are particularly useful in those targets where High-Throughput Screening (HTS) performs poorly or it is not an option (e.g. technically not possible, too expensive or too slow). There are also methods devised for optimising the potency of drug leads as well as predicting their off-targets.

For the scenario where one has a molecule with affinity for the target of interest, we devised a ligand-based VS method named Ultrafast Shape Recognition (USR)(4). USR searches these libraries for molecules with a similar 3D shape to that of this template. This is beneficial in that similarly shaped molecules are likely to both hit the same targets as the search template and have a different chemical scaffold(4). Others have built upon this concept by incorporating the spatial distribution of pharmacophoric properties to the search, as in USRCAT(5). We have recently implemented both tools in the USR-VS(6) webserver to carry out large-scale prospective VS.

If a structural model of the protein target is available (e.g. X-ray crystal structure), structure-based methods such as molecular docking can be used to predict the strength with which a molecule binds the target. Docking is useful to identify new drug leads for a target or design more potent drug leads. The single most important limitation of docking is in ranking molecules by their predicted binding strength, which is carried out by specialised Scoring Functions (SFs). In this area, we demonstrated(7) the advantages of machine-learning SFs over classical SFs (i.e. those based on a linear combination of features). We revealed(8) that a more precise chemical description of the protein-ligand complex does not generally lead to more predictive SFs as it was generally thought. We recently show(9) that the performance of classical SFs quickly stagnates with increasing training data size, unlike that of machine-learning SFs. When tailored for VS, we have found(10) that machine-learning SFs obtain substantially improved VS performance by training with unusually large sets of inactives.

In the best case scenario, a drug lead with high potency against its intended target is generated at a high financial and time expense. Unfortunately, many of these optimised leads turn out to be not cell-active in the end and hence have no therapeutic value. With our collaborators in the UK, we have implemented as a webserver an existing method to predict the cell line growth inhibition induced by a molecule(19). This can be used to position a lead on a cancer type by predicting on which cell lines this would induce stronger growth inhibition. This tool can also be used for phenotypic drug design, where a large library of molecules is searched for those predicted to be more active on a given cancer type. Thereafter, it is desirable to predict which the targets of the resulting phenotypic hits are. With this purpose, we have developed and validated a target prediction method(11), which is available as webserver(12). Recently, we have also developed a method to predict the synergy of drugs in inhibiting cancer cell lines(13).


In prospective VS studies, we have observed that USR excels at discovering bioactive molecules with new chemical scaffolds (14–17). Several collaborations are ongoing to discover novel ligands for other targets using USR and USRCAT. We have also used a machine-learning SF (RF-Score) as a part of a hierarchical VS protocol, which led to the discovery of a large proportion of inhibitors of an antibacterial target(15). However, unlike RF-Score, RF-Score-VS was devised specifically for VS, which results in substantially better VS results, as discussed in our review(18). We have now initiated collaborations to use machine-learning SFs for prospective VS against several cancer targets. On the other hand, we are using MolTarPred(12) to predict the targets of some clinical drugs. Our collaborators have experimentally confirmed some of the predicted targets (one of these previously unknown targets binds to the drug with a 300 nM potency).

On the phenotypic drug design side, we have predicted the growth inhibition potencies and pairwise synergies of a large set of clinical drugs on cancer cell lines using (19) and (13), respectively. Selected predictions are currently being validated in vitro by our collaborators.